From guido at python.org  Fri Sep  1 00:04:50 2006
From: guido at python.org (Guido van Rossum)
Date: Thu, 31 Aug 2006 15:04:50 -0700
Subject: [Python-3000] Making more effective use of slice objects in Py3k
In-Reply-To: <1cb725390608311313h4eac0f98x85a0690d3082b533@mail.gmail.com>
References: <20060827184941.1AE8.JCARLSON@uci.edu> <ed1q7r$v4s$2@sea.gmane.org>
	<20060829102307.1B0F.JCARLSON@uci.edu> <ed1uds$iog$1@sea.gmane.org>
	<ed3iq2$9iv$1@sea.gmane.org>
	<6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com>
	<20060831044354.GH6257@performancedrivers.com>
	<44F72E75.2050204@acm.org>
	<ca471dc20608311155i89d671dtdf99907674cbf87d@mail.gmail.com>
	<1cb725390608311313h4eac0f98x85a0690d3082b533@mail.gmail.com>
Message-ID: <ca471dc20608311504u1d3a804en1a31527d87bacc69@mail.gmail.com>

(Adding back py3k list assuming you just forgot it)

On 8/31/06, Paul Prescod <paul at prescod.net> wrote:
> On 8/31/06, Guido van Rossum <guido at python.org> wrote:
>
> > > (The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes
> > > per character, and doesn't support the supplemental characters above
> > > 0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.)
> >
> > I think we should also support UTF-16, since Java and .NET (and
> > Win32?) appear to be using effectively; making surrogate handling an
> > application issue doesn't seem *too* big of a burden for many apps.
>
> I think that the reason that UTF-16 seems "not too big of a burden" is
> because people just ignore the UTF-16-ness of the data and hope that people
> don't use those characters. In effect they trade correctness and
> internationalization for simplicity and performance. It seems like it may
> become a bigger issue as time goes by.

Well there's a large class of apps that don't do anything for which
surrogates matter, since they just copy strings around and only split
them at specific characters.  E.g. parsing XML would often fall in
this category.

> Plus, it sounds like you're proposing that the encodings of the underlying
> data would leak through to the application. As I understood Fredrick's
> model, the intention was to treat the encoding as an implementation detail.
> If it works well, this could be an important differentiator for Python
> (versus Java) as Unicode already is (versus Ruby).

*Only* for UTF-16, which I consider a necessary evil since we can't
rewrite the Java and .NET standards.

> So my basic feeling is that if we're going to hide UTF-8 from the programmer
> then we might as well go the extra mile and hide UTF-16 as well.

I don't think the issues are the same.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From brett at python.org  Fri Sep  1 00:18:17 2006
From: brett at python.org (Brett Cannon)
Date: Thu, 31 Aug 2006 15:18:17 -0700
Subject: [Python-3000] Exception Expressions
In-Reply-To: <76fd5acf0608311450r6fbddd44n28ab6f83741b8699@mail.gmail.com>
References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com>
	<bbaeab100608311120v67b23b79p15c2d46fe86cbed9@mail.gmail.com>
	<76fd5acf0608311450r6fbddd44n28ab6f83741b8699@mail.gmail.com>
Message-ID: <bbaeab100608311518g29c1b4a5x38834d4f5582e4f1@mail.gmail.com>

On 8/31/06, Calvin Spealman <ironfroggy at gmail.com> wrote:
>
> On 8/31/06, Brett Cannon <brett at python.org> wrote:
> > So this feels like the Perl idiom of using die: ``open(file) or die``
> (or
> > something like that; I have never been a Perl guy so I could be off).
> >
> > > ...
> >
> > The problem I have with this whole proposal is that catching exceptions
> > should be very obvious in the source code.  This proposal does not help
> with
> > that ideal.  So I am -1 on the whole idea.
> >
> > -Brett
>
> "Ouch" on the associated my idea with perl!


=)  The truth hurts.

Although I agree that it is good to be obvious about exceptions, there
> are some cases when they are simply less than exceptional. For
> example, you can do d.get(key, default) if you know something is a
> dictionary, but for general mappings you can't rely on that, and may
> often use exceptions as a kind of logic control. No, that doesn't sync
> with the purity of exceptions, but sometimes practicality and
> real-world usage trumps theory.


Practically most definitely beats purity, but I don't see the practicality
of this over  what we already have.

Only allowing a single expression, it shouldn't be able to get ugly.


Famous last words.  Remember a big argument against the 'if' expressions was
about them getting too unwieldly in terms of length and obscuring the fact
that it is a conditional.  I have used 'if' expressions and they have been
hard to keep very readable unless you are willing to use parentheses and
make them unreadable.  I would be afraid of this happening here, but to an
even more important construct that should always be easy to spot in source
code.

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060831/f41c8b7d/attachment.htm 

From walter at livinglogic.de  Fri Sep  1 00:24:35 2006
From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=)
Date: Fri, 01 Sep 2006 00:24:35 +0200
Subject: [Python-3000] Comment on iostack library
In-Reply-To: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com>
References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com>
Message-ID: <44F761A3.5060009@livinglogic.de>

tomer filiba wrote:

> [...]
> besides, encoding suffers from many issues. suppose you have a
> damaged UTF8 file, which you read char-by-char. when we reach the
> damaged part, you'll never be able to "skip" it, as we'll just keep
> read()ing bytes, hoping to make a character out of it , until we
> reach EOF, i.e.:
> 
> def read_char(self):
>     buf = ""
>     while not self._stream.eof:
>         buf += self._stream.read(1)
>         try:
>             return buf.decode("utf8")
>         except ValueError:
>             pass
> 
> which leads me to the following thought: maybe we should have
> an "enhanced" encoding library for py3k, which would report
> *incomplete* data differently from *invalid* data. today it's just a
> ValueError: suppose decode() would raise IncompleteDataError
> when the given data is not sufficient to be decoded successfully,
> and ValueError when the data is just corrupted.
> 
> that could aid iostack greatly.

We *do* have that functionality in Python 2.5: incremental decoders can
retain incomplete byte sequences on the call to the decode() method
until the next call. Only when final=True is passed in the decode() call
will it treat incomplete and invalid data in the same way: by raising an
exception.

Incomplete input:
>>> import codecs
>>> d = codecs.lookup("utf-8").incrementaldecoder()
>>> d.decode("\xe1")
u''
>>> d.decode("\x88")
u''
>>> d.decode("\xb4")
u'\u1234'

Invalid input:
>>> import codecs
>>> d = codecs.lookup("utf-8").incrementaldecoder()
>>> d.decode("\x80")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256,
in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
unexpected code byte

Incomplete input with final=True:
>>> import codecs
>>> d = codecs.lookup("utf-8").incrementaldecoder()
>>> d.decode("\xe1", final=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256,
in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0:
unexpected end of data

Servus,
   Walter


From greg.ewing at canterbury.ac.nz  Fri Sep  1 04:39:37 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Fri, 01 Sep 2006 14:39:37 +1200
Subject: [Python-3000] Exception Expressions
In-Reply-To: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com>
References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com>
Message-ID: <44F79D69.6090909@canterbury.ac.nz>

Calvin Spealman wrote:

> Other example use cases:
> 
>     # Fallback on an alternative path
> 
>     # Handle divide-by-zero

or get by with index() instead of find():

    s.index("foo") except -1 if IndexError # :-)

 >     open(filename) except open(filename2) if IOError

One problem is that it doesn't seem to chain
all that well. Suppose you had three files to
try opening:

   open(name1) except (open(name2) except open(name3) if IOError) if IOError

Maybe it would be better if the exception type
and alternative expression were swapped over.
Then you could write

   open(name1) except IOError then open(name2) except IOError then open(name3)

Still rather unwieldy though. -0.7j, I think
(the j to acknowledge that this is an imaginary
proposal.:-)

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From david.nospam.hopwood at blueyonder.co.uk  Fri Sep  1 04:53:21 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Fri, 01 Sep 2006 03:53:21 +0100
Subject: [Python-3000] Comment on iostack library
In-Reply-To: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com>
References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com>
Message-ID: <44F7A0A1.30300@blueyonder.co.uk>

tomer filiba wrote:
> [Talin]
> 
>>Well, as far as readline goes: In order to split the text into lines,
>>you have to decode the text first anyway, which is a layer 3 operation.
>>You can't just read bytes until you get a \n, because the file you are
>>reading might be encoded in UCS2 or something.
> 
> well, the LineBufferedLayer can be "configured" to split on any
> "marker", i.e.: LineBufferedLayer(stream, marker = "\x00\x0a")
> and of course layer 3, which creates layer 2, can set this marker
> to any byte sequence. note it's a *byte* sequence, not chars,
> since this passes down to layer 1 transparently.

That isn't what is required; for big-endian UCS-2 or UTF-16, "\x00\x0a"
should only be recognized as LF if it is at an even byte position.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From talin at acm.org  Fri Sep  1 05:13:27 2006
From: talin at acm.org (Talin)
Date: Thu, 31 Aug 2006 20:13:27 -0700
Subject: [Python-3000] Making more effective use of slice objects in Py3k
In-Reply-To: <ca471dc20608311155i89d671dtdf99907674cbf87d@mail.gmail.com>
References: <20060827184941.1AE8.JCARLSON@uci.edu>
	<ed1q7r$v4s$2@sea.gmane.org>	
	<20060829102307.1B0F.JCARLSON@uci.edu>
	<ed1uds$iog$1@sea.gmane.org>	 <ed3iq2$9iv$1@sea.gmane.org>	
	<6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com>	
	<20060831044354.GH6257@performancedrivers.com>	
	<44F72E75.2050204@acm.org>
	<ca471dc20608311155i89d671dtdf99907674cbf87d@mail.gmail.com>
Message-ID: <44F7A557.2010002@acm.org>

Guido van Rossum wrote:
> On 8/31/06, Talin <talin at acm.org> wrote:
>> One way to handle this efficiently would be to only support the
>> encodings which have a constant character size: ASCII, Latin-1, UCS-2
>> and UTF-32. In other words, if the content of your text is plain ASCII,
>> use an 8-bit-per-character string; If the content is limited to the
>> Unicode BMF (Basic Multilingual Plane) use UCS-2; And if you are using
>> Unicode supplementary characters, use UTF-32.
>>
>> (The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes
>> per character, and doesn't support the supplemental characters above
>> 0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.)
> 
> I think we should also support UTF-16, since Java and .NET (and
> Win32?) appear to be using effectively; making surrogate handling an
> application issue doesn't seem *too* big of a burden for many apps.

I see that I misspoke - what I meant was, that we would "suppport" all 
of the available encodings in the sense that we could translate string 
objects to and from those encodings. But the internal representations of 
the string objects themselves would only use those encodings which 
represented a character in a fixed number of bytes.

Moreover, this internal representation should be opaque to users of the 
string - if you want to write out a string as UTF-8 to a file, go for 
it, it shouldn't matter what the internal type of the string is.

(Although Jython and IronPython should probably use whatever string 
representation is defined by the underlying VM.)

>> By avoiding UTF-8, UTF-16 and other variable-character-length formats,
>> you can always insure that character index operations are done in
>> constant time. Index operations would simply require scaling the index
>> by the character size, rather than having to scan through the string and
>> count characters.
>>
>> The drawback of this method is that you may be forced to transform the
>> entire string into a wider encoding if you add a single character that
>> won't fit into the current encoding.
> 
> A way to handle UTF-8 strings and other variable-length encodings
> would be to maintain a small cache of index positions with the string
> object.

Actually, I realized that this drawback isn't really much of an issue at 
all. For virtually all string operations in Python, it is possible to 
predict ahead of time what string width will be required - thus you can 
allocated the proper width object up front, and not have to "widen" the 
string in mid-operation.

So for example, any string operation which produces a subset of the 
string (such as partition, split, index, slice, etc.) will produce a 
string of the same width as the original string.

Any string operation that involves combining two strings will produce a 
string that is the same type as the wider of the two strings. Thus, if I 
say something like:

    "Hello World" + chr( 0x8000 )

This will produce a 16-bits wide string, because 'chr( 0x8000 )' can't 
be represented in ASCII, and thus produces a 16-bit-wide string. Since 
the first string is plain ASCII (8 bits) and the second is 16 bits, the 
result of the concatenation is a 16-bit string.

Similarly, transformations on strings such as upper / lower yield a 
string that is the same width as the original.

The only case I can think of where you might need to "promote" an entire 
string is where you are concatenating to a string buffer, in other words 
you are dealing with a mutable string type. And this case is easily 
handled by simply making the mutable string buffer type always use 
UTF-32, and then narrowing the result when str() is called to the 
narrowest possible representation that can hold the result.

So essentially what I am proposing is this:

-- That the Python 3000 "str" type can consist of 8-bit, 16-bit, or 
32-bit characters, where all characters within a string are the same 
number of bytes.

-- That all 3 types of strings appear identical to Python programmers, 
such that they need not know what type of string they are using.

-- Any operation that returns a string result has the responsibility to 
insure that the resulting string is wide enough to contain all of the 
characters produced by the operation.

-- That string index operations will always be constant time, with no 
auxiliary data structures required.

-- That all 3 string types can be converted into all of the available 
encodings, including variable-character-width formats, however the 
result is a "bytes" object, not a string.

An additional, but separate part of the proposal is that for str 
objects, the contents of the string are always defined in terms of 
Unicode code points. So if you want to convert to ISO-Latin-1, you can, 
but the result is a bytes object, not a string. The advantage of this is 
that it means that you always know what the value of 'ord()' is for a 
given character. It also means that two strings can always be compared 
for equality without having to decode them first.

>> (Another option is to simply make all strings UTF-32 -- which is not
>> that unreasonable, considering that text strings normally make up only a
>> small fraction of a program's memory footprint. I am sure that there are
>> applications that don't conform to this generalization, however. )
> 
> Here you are effectively voting against polymorphic strings. I believe
> Fredrik has good reasons to doubt this assertion.

Yes, that is correct. I'm just throwing it out there as a possibility, 
as it is by far the simplest solution. Its a question of trading memory 
use for simplicity of implementation. Having a single, flat, internal 
representation for all strings would be much less complex than having 
different string types.

-- Talin

From ironfroggy at gmail.com  Fri Sep  1 05:21:08 2006
From: ironfroggy at gmail.com (Calvin Spealman)
Date: Thu, 31 Aug 2006 23:21:08 -0400
Subject: [Python-3000] Exception Expressions
In-Reply-To: <44F79D69.6090909@canterbury.ac.nz>
References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com>
	<44F79D69.6090909@canterbury.ac.nz>
Message-ID: <76fd5acf0608312021w1e0cf0f3md00ee5232f3ef9f4@mail.gmail.com>

On 8/31/06, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> One problem is that it doesn't seem to chain
> all that well. Suppose you had three files to
> try opening:
>
>    open(name1) except (open(name2) except open(name3) if IOError) if IOError
>
> Maybe it would be better if the exception type
> and alternative expression were swapped over.
> Then you could write
>
>    open(name1) except IOError then open(name2) except IOError then open(name3)
>
> Still rather unwieldy though. -0.7j, I think
> (the j to acknowledge that this is an imaginary
> proposal.:-)
>
> --
> Greg Ewing, Computer Science Dept, +--------------------------------------+
> University of Canterbury,          | Carpe post meridiem!                 |
> Christchurch, New Zealand          | (I'm not a morning person.)          |
> greg.ewing at canterbury.ac.nz        +--------------------------------------+

I considered the expr1 except exc_type then expr2 syntax, but it adds
a keyword without much need to do so. But, I suppose that isn't a
problem now that conditional expressions are in and then is already a
keyword.

I hereby upgrade this from imaginary proposal to real proposal status!

From paul at prescod.net  Fri Sep  1 05:32:32 2006
From: paul at prescod.net (Paul Prescod)
Date: Thu, 31 Aug 2006 20:32:32 -0700
Subject: [Python-3000] UTF-16
Message-ID: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com>

On 8/31/06, Guido van Rossum <guido at python.org> wrote:
>
> (Adding back py3k list assuming you just forgot it)


Yes, thanks. Gmail's UI really optimizes the "Reply To" operation of "Reply
To All."

> Plus, it sounds like you're proposing that the encodings of the underlying
> > data would leak through to the application. As I understood Fredrick's
> > model, the intention was to treat the encoding as an implementation
> detail.
> > If it works well, this could be an important differentiator for Python
> > (versus Java) as Unicode already is (versus Ruby).
>
> *Only* for UTF-16, which I consider a necessary evil since we can't
> rewrite the Java and .NET standards.


I see what you're getting at.

I'd say that decoding UTF-16 data in CPython and PyPy should (by default)
create true Unicode characters. Jython and IronPython could create
surrogates and characters when necessary. When you run the program in
CPython you'll get better behaviour than in Jython/IronPython. Maybe there
could be a way to make CPython run like Jython and IronPython if you wanted
100% absolute compatibility between the environments. I think that we agree
that it would be unfortunate if CPython copied Java and .NET to its own
detriment. It's also not inconceivable that Java and .NET might evolve a
4-byte mode in the long term.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060831/688d3cc1/attachment.html 

From guido at python.org  Fri Sep  1 05:46:55 2006
From: guido at python.org (Guido van Rossum)
Date: Thu, 31 Aug 2006 20:46:55 -0700
Subject: [Python-3000] UTF-16
In-Reply-To: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com>
References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com>
Message-ID: <ca471dc20608312046m120f23adx494d50723289029@mail.gmail.com>

On 8/31/06, Paul Prescod <paul at prescod.net> wrote:
> On 8/31/06, Guido van Rossum <guido at python.org> wrote:
> > (Adding back py3k list assuming you just forgot it)
>
> Yes, thanks. Gmail's UI really optimizes the "Reply To" operation of "Reply
> To All."
>
> > > Plus, it sounds like you're proposing that the encodings of the
> underlying
> > > data would leak through to the application. As I understood Fredrick's
> > > model, the intention was to treat the encoding as an implementation
> detail.
> > > If it works well, this could be an important differentiator for Python
> > > (versus Java) as Unicode already is (versus Ruby).
> >
> > *Only* for UTF-16, which I consider a necessary evil since we can't
> > rewrite the Java and .NET standards.
>
> I see what you're getting at.
>
> I'd say that decoding UTF-16 data in CPython and PyPy should (by default)
> create true Unicode characters. Jython and IronPython could create
> surrogates and characters when necessary. When you run the program in
> CPython you'll get better behaviour than in Jython/IronPython. Maybe there
> could be a way to make CPython run like Jython and IronPython if you wanted
> 100% absolute compatibility between the environments. I think that we agree
> that it would be unfortunate if CPython copied Java and .NET to its own
> detriment. It's also not inconceivable that Java and .NET might evolve a
> 4-byte mode in the long term.

I think it would be best to do this as a CPython configuration option
just like it's done today. You can choose 4-byte or 2-byte Unicode
(essentially UCS-4 or UTF-16) in order to be compatible with other
packages on the platform. Yes, 4-byte gives better Unicode support.
But 2-bytes may be more compatible with other stuff on the platform.
Too bad .NET and Java don't have this option. :-)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Fri Sep  1 06:13:29 2006
From: guido at python.org (Guido van Rossum)
Date: Thu, 31 Aug 2006 21:13:29 -0700
Subject: [Python-3000] Making more effective use of slice objects in Py3k
In-Reply-To: <44F7A557.2010002@acm.org>
References: <20060827184941.1AE8.JCARLSON@uci.edu> <ed1q7r$v4s$2@sea.gmane.org>
	<20060829102307.1B0F.JCARLSON@uci.edu> <ed1uds$iog$1@sea.gmane.org>
	<ed3iq2$9iv$1@sea.gmane.org>
	<6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com>
	<20060831044354.GH6257@performancedrivers.com>
	<44F72E75.2050204@acm.org>
	<ca471dc20608311155i89d671dtdf99907674cbf87d@mail.gmail.com>
	<44F7A557.2010002@acm.org>
Message-ID: <ca471dc20608312113i74c7cc98t79255b62aeb22816@mail.gmail.com>

On 8/31/06, Talin <talin at acm.org> wrote:
> > Here you are effectively voting against polymorphic strings. I believe
> > Fredrik has good reasons to doubt this assertion.
>
> Yes, that is correct. I'm just throwing it out there as a possibility,
> as it is by far the simplest solution. Its a question of trading memory
> use for simplicity of implementation. Having a single, flat, internal
> representation for all strings would be much less complex than having
> different string types.

I think you don't realize the significance of the immediate
enthusiastic +1 votes from several OSX developers.

These people are quite familiar with ObjectiveC. ObjectiveC has true
polymorphic strings, and the internal representation *can* be UTF-8.
These developers love that.

For most practical purposes the internal representation is abstracted
away from the application; *however* it is possible to go below this
level, especially for I/O (I believe). The net effect, if I understand
correctly, is that you can save yourself a lot of copying if you are
mostly just moving whole strings around and doing relatively little
slicing and dicing -- it avoids converting from UTF-8 (which is by far
the most common external representation) to UCS-2 or UCS-4 and back
again.

I don't think these advantages are maintained by your "narrowest
constant-width encoding that fits all the characters" proposal.

I'm not saying that we should definitely adopt this -- it may well be
that the ObjectiveC string API is significantly different from
Python's (e.g. it could have less emphasis on character indices and
character counts) so that the benefits would be lost in translation --
but I'm not sure that the added complexity of your proposal is
warranted if it still requires encoding and decoding on most I/O
operations.

BTW, in some sense Python 2.x *has* polymorphic strings -- str and
unicde have the same API (99% anyway) but different implementations,
and there's even a common abstract base class (basestring). But this
clearly isn't what the ObjectiveC folks want to see!

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From paul at prescod.net  Fri Sep  1 06:24:19 2006
From: paul at prescod.net (Paul Prescod)
Date: Thu, 31 Aug 2006 21:24:19 -0700
Subject: [Python-3000] UTF-16
In-Reply-To: <ca471dc20608312046m120f23adx494d50723289029@mail.gmail.com>
References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com>
	<ca471dc20608312046m120f23adx494d50723289029@mail.gmail.com>
Message-ID: <1cb725390608312124u24d20ec2q27dbe5a69c2440d3@mail.gmail.com>

On 8/31/06, Guido van Rossum <guido at python.org> wrote:
>
> On 8/31/06, Paul Prescod <paul at prescod.net> wrote:
> > On 8/31/06, Guido van Rossum <guido at python.org> wrote:
> > > (Adding back py3k list assuming you just forgot it)
> >
> > Yes, thanks. Gmail's UI really optimizes the "Reply To" operation of
> "Reply
> > To All."
> >
> > > > Plus, it sounds like you're proposing that the encodings of the
> > underlying
> > > > data would leak through to the application. As I understood
> Fredrick's
> > > > model, the intention was to treat the encoding as an implementation
> > detail.
> > > > If it works well, this could be an important differentiator for
> Python
> > > > (versus Java) as Unicode already is (versus Ruby).
> > >
> > > *Only* for UTF-16, which I consider a necessary evil since we can't
> > > rewrite the Java and .NET standards.
> >
> > I see what you're getting at.
> >
> > I'd say that decoding UTF-16 data in CPython and PyPy should (by
> default)
> > create true Unicode characters. Jython and IronPython could create
> > surrogates and characters when necessary. When you run the program in
> > CPython you'll get better behaviour than in Jython/IronPython. Maybe
> there
> > could be a way to make CPython run like Jython and IronPython if you
> wanted
> > 100% absolute compatibility between the environments. I think that we
> agree
> > that it would be unfortunate if CPython copied Java and .NET to its own
> > detriment. It's also not inconceivable that Java and .NET might evolve a
> > 4-byte mode in the long term.
>
> I think it would be best to do this as a CPython configuration option
> just like it's done today. You can choose 4-byte or 2-byte Unicode
> (essentially UCS-4 or UTF-16) in order to be compatible with other
> packages on the platform. Yes, 4-byte gives better Unicode support.
> But 2-bytes may be more compatible with other stuff on the platform.
> Too bad .NET and Java don't have this option. :-)


The current model is a hack (and I wrote the PEP!).

If you decide to go to all of the effort and expense of polymorphic strings,
I cannot understand why a user should be forced to choose between 16 and 32
bit strings AT BUILD TIME. PEP 261 says that reason for the build-time
solution is:

"[The alternate solutions] ... would require a much more
complex implementation than the accepted solution. ...
Guido is not willing to undertake the implementation right
now. ...This PEP represents least-effort solution."

Fair enough. A world of finite resouces. But I would be very annoyed if my
ISP had installed a Python version that could magically handle 8-bit and
16-bit strings efficiently but I had to ask them to install a special
version to handle 32 bit strings at all. Obviously build-time configuration
is the least flexible of all available options.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060831/3dd236f2/attachment.htm 

From fredrik at pythonware.com  Fri Sep  1 07:57:06 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 01 Sep 2006 07:57:06 +0200
Subject: [Python-3000] Making more effective use of slice objects in Py3k
In-Reply-To: <44F7A557.2010002@acm.org>
References: <20060827184941.1AE8.JCARLSON@uci.edu>	<ed1q7r$v4s$2@sea.gmane.org>		<20060829102307.1B0F.JCARLSON@uci.edu>	<ed1uds$iog$1@sea.gmane.org>	
	<ed3iq2$9iv$1@sea.gmane.org>		<6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com>		<20060831044354.GH6257@performancedrivers.com>		<44F72E75.2050204@acm.org>	<ca471dc20608311155i89d671dtdf99907674cbf87d@mail.gmail.com>
	<44F7A557.2010002@acm.org>
Message-ID: <ed8i3i$at0$1@sea.gmane.org>

Talin wrote:

> So essentially what I am proposing is this:

"look at me! look at me!"

</F>


From fredrik at pythonware.com  Fri Sep  1 08:05:18 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 01 Sep 2006 08:05:18 +0200
Subject: [Python-3000] Making more effective use of slice objects in Py3k
In-Reply-To: <ca471dc20608312113i74c7cc98t79255b62aeb22816@mail.gmail.com>
References: <20060827184941.1AE8.JCARLSON@uci.edu>
	<ed1q7r$v4s$2@sea.gmane.org>	<20060829102307.1B0F.JCARLSON@uci.edu>
	<ed1uds$iog$1@sea.gmane.org>	<ed3iq2$9iv$1@sea.gmane.org>	<6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com>	<20060831044354.GH6257@performancedrivers.com>	<44F72E75.2050204@acm.org>	<ca471dc20608311155i89d671dtdf99907674cbf87d@mail.gmail.com>	<44F7A557.2010002@acm.org>
	<ca471dc20608312113i74c7cc98t79255b62aeb22816@mail.gmail.com>
Message-ID: <ed8iiu$cba$1@sea.gmane.org>

Guido van Rossum wrote:

> BTW, in some sense Python 2.x *has* polymorphic strings -- str and
> unicde have the same API (99% anyway) but different implementations,
> and there's even a common abstract base class (basestring). But this
> clearly isn't what the ObjectiveC folks want to see!

on the Python level, absolutely.  the "use 8-bit strings for ASCII, 
Unicode strings for everything else" approach works perfectly well.

I'm still a bit worried about C API complexities, but as I mentioned, in 
today's Python, only 8-bit strings are really simple.  and there are 
standard ways to deal with backing stores; if that's good enough for 
apple hackers, it should be good enough for pythoneers.

most of this can be prototyped and benchmarked under 2.X, and parts of 
it can be directly useful also for 2.X developers; I think I'll start 
tinkering.

> These people are quite familiar with ObjectiveC. ObjectiveC has true
> polymorphic strings, and the internal representation *can* be UTF-8.
> These developers love that.

you are aware that Objective C does provide B-tree strings under the 
hood too, I hope ;-)

</F>


From fredrik at pythonware.com  Fri Sep  1 08:22:54 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 01 Sep 2006 08:22:54 +0200
Subject: [Python-3000] Making more effective use of slice objects in Py3k
In-Reply-To: <ed7iii$psn$1@sea.gmane.org>
References: <20060827184941.1AE8.JCARLSON@uci.edu>	<ed1q7r$v4s$2@sea.gmane.org><20060829102307.1B0F.JCARLSON@uci.edu>	<ed1uds$iog$1@sea.gmane.org><ed3iq2$9iv$1@sea.gmane.org>	<6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com>
	<ed7iii$psn$1@sea.gmane.org>
Message-ID: <ed8jju$e52$1@sea.gmane.org>

tjreedy wrote:

> These two similar features would be enough, to me, to make Py3 more than 
> just 2.x with cruft removed.

well, it's really only C API issues that keeps us from implementing this 
in 2.x... (too much code uses PyString_Check and/or PyUnicode_Check and 
then happily digs into the associated buffers).

</F>


From fredrik at pythonware.com  Fri Sep  1 08:46:23 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 01 Sep 2006 08:46:23 +0200
Subject: [Python-3000] Making more effective use of slice objects in Py3k
In-Reply-To: <ca471dc20608311155i89d671dtdf99907674cbf87d@mail.gmail.com>
References: <20060827184941.1AE8.JCARLSON@uci.edu>
	<ed1q7r$v4s$2@sea.gmane.org>	<20060829102307.1B0F.JCARLSON@uci.edu>
	<ed1uds$iog$1@sea.gmane.org>	<ed3iq2$9iv$1@sea.gmane.org>	<6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com>	<20060831044354.GH6257@performancedrivers.com>	<44F72E75.2050204@acm.org>
	<ca471dc20608311155i89d671dtdf99907674cbf87d@mail.gmail.com>
Message-ID: <ed8kvv$j02$1@sea.gmane.org>

Guido van Rossum wrote:

> A way to handle UTF-8 strings and other variable-length encodings
> would be to maintain a small cache of index positions with the string
> object.

I think just delaying decoding would take us most of the way.  the big 
advantage of storage polymorphism is that you can avoid decoding and 
encoding (and having to pay for the cycles and bytes needed for that) if 
you don't do have to.  the XML case you mentioned is a typical example; 
just compare the behaviour of a library that does some extra work to 
keep things small under the hood with more straightforward implementations:

     http://effbot.org/zone/celementtree.htm#benchmarks

(cElementTree uses the "8-bit ascii mixes well with unicode" approach)

there are plenty of optimizations you can do when accessing the 
beginning and end of a string (startswith, endswith, comparisions, 
slicing, etc), but I think we can deal with that when we get there.
I think the NFS sprint showed that you get better results by working 
with real use cases, rather than spending that theorizing.  it also 
showed that the bottlenecks aren't always where you think they are.

</F>


From fredrik at pythonware.com  Fri Sep  1 08:49:38 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 01 Sep 2006 08:49:38 +0200
Subject: [Python-3000] UTF-16
In-Reply-To: <ca471dc20608312046m120f23adx494d50723289029@mail.gmail.com>
References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com>
	<ca471dc20608312046m120f23adx494d50723289029@mail.gmail.com>
Message-ID: <ed8l62$j02$2@sea.gmane.org>

Guido van Rossum wrote:

> I think it would be best to do this as a CPython configuration option
> just like it's done today. You can choose 4-byte or 2-byte Unicode
> (essentially UCS-4 or UTF-16) in order to be compatible with other
> packages on the platform. Yes, 4-byte gives better Unicode support.
> But 2-bytes may be more compatible with other stuff on the platform.
> Too bad .NET and Java don't have this option. :-)

the UCS2/UCS4 linking problems is a minor pain in the ass, though. 
maybe this is best done via a run-time setting?

</F>


From fredrik at pythonware.com  Fri Sep  1 09:56:52 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 1 Sep 2006 09:56:52 +0200
Subject: [Python-3000] Making more effective use of slice objects in Py3k
References: <20060827184941.1AE8.JCARLSON@uci.edu><ed1q7r$v4s$2@sea.gmane.org>	<20060829102307.1B0F.JCARLSON@uci.edu>	<ed1uds$iog$1@sea.gmane.org><ed3iq2$9iv$1@sea.gmane.org>	<6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com><20060831044354.GH6257@performancedrivers.com>
	<44F72E75.2050204@acm.org>
Message-ID: <ed8p44$v5a$1@sea.gmane.org>

Talin wrote:

> (Another option is to simply make all strings UTF-32 -- which is not
> that unreasonable, considering that text strings normally make up only a
> small fraction of a program's memory footprint. I am sure that there are
> applications that don't conform to this generalization, however. )

performance is more than just memory use, though.  for some string operations,
memory bandwidth is the bottleneck, not memory use.  it simply takes more time
to process four times as much data.

(running the stringbench.py script in the sandbox on a recent 2.5 should give you
some idea of this)

</F> 




From fredrik at pythonware.com  Fri Sep  1 10:01:45 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 1 Sep 2006 10:01:45 +0200
Subject: [Python-3000] locale-aware strings ?
Message-ID: <ed8pd9$ch$1@sea.gmane.org>

today's Python supports "locale aware" 8-bit strings; e.g.

    >>> import locale
    >>> "åäö".isalpha()
    False
    >>> locale.setlocale(locale.LC_ALL, "sv_SE")
    'sv_SE'
    >>> "åäö".isalpha()
    True

to what extent should this be supported by Python 3000 ?

</F> 




From tomerfiliba at gmail.com  Fri Sep  1 10:05:10 2006
From: tomerfiliba at gmail.com (tomer filiba)
Date: Fri, 1 Sep 2006 10:05:10 +0200
Subject: [Python-3000] Comment on iostack library
In-Reply-To: <44F761A3.5060009@livinglogic.de>
References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com>
	<44F761A3.5060009@livinglogic.de>
Message-ID: <1d85506f0609010105n69e8cdcbw989f861e05ca7a24@mail.gmail.com>

very well, i'll use it. thanks.

On 9/1/06, Walter D?rwald <walter at livinglogic.de> wrote:
> tomer filiba wrote:
>
> > [...]
> > besides, encoding suffers from many issues. suppose you have a
> > damaged UTF8 file, which you read char-by-char. when we reach the
> > damaged part, you'll never be able to "skip" it, as we'll just keep
> > read()ing bytes, hoping to make a character out of it , until we
> > reach EOF, i.e.:
> >
> > def read_char(self):
> >     buf = ""
> >     while not self._stream.eof:
> >         buf += self._stream.read(1)
> >         try:
> >             return buf.decode("utf8")
> >         except ValueError:
> >             pass
> >
> > which leads me to the following thought: maybe we should have
> > an "enhanced" encoding library for py3k, which would report
> > *incomplete* data differently from *invalid* data. today it's just a
> > ValueError: suppose decode() would raise IncompleteDataError
> > when the given data is not sufficient to be decoded successfully,
> > and ValueError when the data is just corrupted.
> >
> > that could aid iostack greatly.
>
> We *do* have that functionality in Python 2.5: incremental decoders can
> retain incomplete byte sequences on the call to the decode() method
> until the next call. Only when final=True is passed in the decode() call
> will it treat incomplete and invalid data in the same way: by raising an
> exception.
>
> Incomplete input:
> >>> import codecs
> >>> d = codecs.lookup("utf-8").incrementaldecoder()
> >>> d.decode("\xe1")
> u''
> >>> d.decode("\x88")
> u''
> >>> d.decode("\xb4")
> u'\u1234'
>
> Invalid input:
> >>> import codecs
> >>> d = codecs.lookup("utf-8").incrementaldecoder()
> >>> d.decode("\x80")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256,
> in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
> unexpected code byte
>
> Incomplete input with final=True:
> >>> import codecs
> >>> d = codecs.lookup("utf-8").incrementaldecoder()
> >>> d.decode("\xe1", final=True)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/var/home/walter/checkouts/Python/test/Lib/codecs.py", line 256,
> in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 0:
> unexpected end of data
>
> Servus,
>    Walter
>
>

From fredrik at pythonware.com  Fri Sep  1 13:14:13 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 1 Sep 2006 13:14:13 +0200
Subject: [Python-3000] Making more effective use of slice objects in Py3k
References: <20060827184941.1AE8.JCARLSON@uci.edu><ed1q7r$v4s$2@sea.gmane.org>	<20060829102307.1B0F.JCARLSON@uci.edu><ed1uds$iog$1@sea.gmane.org>	<ed3iq2$9iv$1@sea.gmane.org>	<6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com>	<20060831044354.GH6257@performancedrivers.com>	<44F72E75.2050204@acm.org><ca471dc20608311155i89d671dtdf99907674cbf87d@mail.gmail.com>
	<ed8kvv$j02$1@sea.gmane.org>
Message-ID: <ed94m6$7na$1@sea.gmane.org>

> spending that theorizing.

make that "spending that time theorizing about what you could, in theory, do."

</F> 




From qrczak at knm.org.pl  Fri Sep  1 13:34:42 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Fri, 01 Sep 2006 13:34:42 +0200
Subject: [Python-3000] Comment on iostack library
In-Reply-To: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com>
	(tomer filiba's message of "Thu, 31 Aug 2006 23:43:44 +0200")
References: <1d85506f0608311443s108822c1n31682ba765b2f3e0@mail.gmail.com>
Message-ID: <87u03r6crx.fsf@qrnik.zagroda>

"tomer filiba" <tomerfiliba at gmail.com> writes:

>> Encoding conversion and newline conversion should be performed a
>> block at a time, below buffering, so not only I/O syscalls, but
>> also invocations of the recoding machinery are amortized by
>> buffering.
>
> you have a good point, which i also stumbled upon when implementing
> the TextInterface. but how would you suggest to solve it?

I've designed and implemented this for my language, but I'm not sure
that you will like it because it's quite different from the Python
tradition.

The interface of block reading appends data to the end of the supplied
buffer, up to the specified size (or infinity), and also it tells
whether it reached end of data. The interface of block writing removes
data from the beginning of the supplied buffer, up to the supplied
size (or the whole buffer), and is told how to flush, which includes
information whether this is the end of data. Both functions are
allowed to read/write less than requested.

The recoding engine moves data from the beginning of an input buffer
to the end of an output buffer. The block recoding function has
similar size parameters as above, and a flushing parameter. It returns
True on output overflow, i.e. when it stopped because it needs more
room in the output rather than because it needs more input. It leaves
unconverted data at the end of the input buffer if data looks incomplete,
unless it is told that this is the last block - in this case it fails.

Both decoding input streams and encoding output streams have a
persistent buffer in the format corresponding to their low end,
i.e. a byte buffer when this is the boundary between bytes and
characters.

This design allows to plug everything together, including the cases
where recoding changes sizes significantly (compression/decompression).

It also allows reading/writing process to be interrupted without
breaking the consistency of the state of buffers, as long as each
primitive reading/writing operation is atomic, i.e. anything it
removes from the input buffer is converted and put in the output
buffer. Data not yet processed by the remaining layers remains in
their respective buffers.

For example reading a block from a decoding stream:
1. If there was no overflow previously, read more data from the
   underlying stream to the internal buffer, up to the supplied
   maximum size.
2. Decode data from the internal buffer to the supplied output buffer,
   up to the supplied maximum size. Tell the recoding engine that this
   is the last piece if there was no overflow previously and reading
   from the underlying stream reached the end.
3. Return True (i.e. end of input) if there was no overflow now and
   reading from the underlying stream reached the end.

Writing a block to an encoding stream is simpler:
1. Encode data from the supplied input buffer to the internal buffer.
2. Write data from the internal buffer to the output stream.

Buffered streams are typically put on the top of the stack. They
support reading a line at a time, unlimited lookahead and unlimited
unreading, and writing which guarantees that it won't leave anything
in the buffer it is writing from.

Newlines are converted by a separate layer. The buffered stream
assumes "\n" endings.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From fredrik at pythonware.com  Fri Sep  1 13:41:00 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 1 Sep 2006 13:41:00 +0200
Subject: [Python-3000] string C API
Message-ID: <ed968c$d03$1@sea.gmane.org>

just noticed that PEP 3100 says that PyString_AsEncodedString and
PyString_AsDecodedString is to be removed, but it doesn't mention
any other PyString (or PyUnicode) functions.

how large changes can we make here, really ?

(I'm not going to sketch on a concrete proposal here; I'm more interested
in general guidelines.  the details are best fleshed out in code)

</F> 




From barry at python.org  Fri Sep  1 14:14:46 2006
From: barry at python.org (Barry Warsaw)
Date: Fri, 1 Sep 2006 08:14:46 -0400
Subject: [Python-3000] UTF-16
In-Reply-To: <ed8l62$j02$2@sea.gmane.org>
References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com>
	<ca471dc20608312046m120f23adx494d50723289029@mail.gmail.com>
	<ed8l62$j02$2@sea.gmane.org>
Message-ID: <188FAEC3-875D-4AA8-8C66-A1DF6F8A96C6@python.org>


On Sep 1, 2006, at 2:49 AM, Fredrik Lundh wrote:

> Guido van Rossum wrote:
>
>> I think it would be best to do this as a CPython configuration option
>> just like it's done today. You can choose 4-byte or 2-byte Unicode
>> (essentially UCS-4 or UTF-16) in order to be compatible with other
>> packages on the platform. Yes, 4-byte gives better Unicode support.
>> But 2-bytes may be more compatible with other stuff on the platform.
>> Too bad .NET and Java don't have this option. :-)
>
> the UCS2/UCS4 linking problems is a minor pain in the ass, though.
> maybe this is best done via a run-time setting?

Yes, the linking problem does crop up from time to time.  Recent  
example: Gentoo Linux is heavily dependent on Python and I recently  
emerged in several packages.  I don't remember the exact details, but  
there was a conflict between UCS2 and UCS4 where two different  
upstream packages required two different linkages, and the wrapping  
Python modules were thus incompatible.  I basically had to decide  
which one I cared about most and delete the other to resolve the  
conflict.  The problem was confusing the hell out of several  
Gentooers until we tracked down all the resources and figured out the  
(suboptimal) fix.

-Barry


From fredrik at pythonware.com  Fri Sep  1 14:23:10 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 1 Sep 2006 14:23:10 +0200
Subject: [Python-3000] UTF-16
References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com><ca471dc20608312046m120f23adx494d50723289029@mail.gmail.com><ed8l62$j02$2@sea.gmane.org>
	<188FAEC3-875D-4AA8-8C66-A1DF6F8A96C6@python.org>
Message-ID: <ed98nf$m3b$1@sea.gmane.org>

Barry Warsaw wrote:

> I recently emerged in several packages.

good thing dictionary.com includes wikipedia articles, or I'd never figured out
if that was a typo or a rather odd spiritual phenomenon.

</F> 




From paul at prescod.net  Fri Sep  1 16:11:35 2006
From: paul at prescod.net (Paul Prescod)
Date: Fri, 1 Sep 2006 07:11:35 -0700
Subject: [Python-3000] Character Set Indepencence
Message-ID: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com>

I thought that others might find this reference interesting. It is Matz (the
inventor of Ruby) talking about why he thinks that Unicode is good for what
it does but not sufficient in general, along with some hints of what he
plans for multinationalization in Ruby. The translation is rough and is
lifted from this email:

http://rubyforge.org/pipermail/rhg-discussion/2006-April/000136.html

I think that the gist of it is that Unicode will be "just one character set"
supported by Ruby. This idea has been kicked around for Python before but
you quickly run into questions about how you compare character strings from
multiple character sets, to say nothing of the complexity of an character
encoding and character set agnostic regular expression engine.

I guess Matz is the right guy to experiment with that stuff. Maybe it could
be copied in Python 4K.

What are your complaints towards Unicode?
* it's thoroughly used, isn't it.
* resentment towards Han unification?
* inferiority complex of Japanese people?
--
What are your complaints towards Unicode?
* no, no I do not have any complaints about Unicode
* in the domains where Unicode is adequate
--
Then, why CSI?

In most applications, UCS is enough thanks to Unicode.
However, there are also applications for which this is not the case.
--
Fields for which Unicode is not enough
Big character sets
* Konjaku-Mojikyo (Japanese encoding which includes many more than Unicode)
* TRON code
* GB18030
--
Fields for which Unicode is not fitted
Legacy encodings
* conversion to UCS is useless
* big conversion tables
* round-trip problem
--
If a language chooses the UCS system
* you cannot write non-UCS applications
* you can't handle text that can't be expressed with Unicode
--
If a language chooses the CSI system
* CSI is a superset of UCS
* Unicode just has to be handled in CSI
--
... is what we can say but
* CSI is difficult
* can it really be implemented?
--
That's where comes out Japan's traditional arts

Adaptation for the Japanese language of applications
* Modification of English language applications to be able to process Japanese
--
Adaptation for the Japanese language of applications

* What engineers of long ago experienced for sure
  - Emacs (NEmacs)
  - Perl (JPerl)
  - Bash
--
Accumulation of know-how

In Japan, the know-how of adaptation for the Japanese language
(multi-byte text processing)
has been accumulated.
--
Accumulation of know-how

in the first place, just for local use,
text using 3 encodings circulate
(4 if including UTF-8)
--
Based on this know-how
* multibyte text encodings
* switching between encodings at the string level
* processing them at practical speed
is finished
--
Available encodings

euc_tw   euc_jp   iso8859_*  utf-8     utf-32le
ascii    euc_kr   koi8       utf-16le  utf-32be
big5     gb2312   sjis       utf-16be

...and many others
If it's a stateless encodings, in principle it can be available.
--
It means
For applications using only one encoding, code conversion is not needed
--
Moreover
Applications wanting to handle multiple encodings can choose an
internal encoding (generally Unicode) that includes all others
--
If you want to
* you can also handle multiple encodings without conversion, letting
characters as they are
* but this is difficult so I do not recommend it
--
However,
only the basic part is done,
it's far from being ready for practical use
* code conversion
* guessing encoding
* etc.
--
For the time being, today
I want to tell everyone:
* UCS is practical
* but not all-purpose
* CSI is not impossible
--
The reason I'm saying that
They may add CSI in Perl6 as they had added
* Methods called by "."
* Continuations
from Ruby.
Basically, they hate losing.
--
Thank you
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060901/46576432/attachment.html 

From jimjjewett at gmail.com  Fri Sep  1 16:24:42 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Fri, 1 Sep 2006 10:24:42 -0400
Subject: [Python-3000] Exception Expressions
In-Reply-To: <bbaeab100608311518g29c1b4a5x38834d4f5582e4f1@mail.gmail.com>
References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com>
	<bbaeab100608311120v67b23b79p15c2d46fe86cbed9@mail.gmail.com>
	<76fd5acf0608311450r6fbddd44n28ab6f83741b8699@mail.gmail.com>
	<bbaeab100608311518g29c1b4a5x38834d4f5582e4f1@mail.gmail.com>
Message-ID: <fb6fbf560609010724n11e3191i5d4a38a81c5a54c3@mail.gmail.com>

On 8/31/06, Brett Cannon <brett at python.org> wrote:
> On 8/31/06, Calvin Spealman <ironfroggy at gmail.com> wrote:
> > On 8/31/06, Brett Cannon <brett at python.org> wrote:
> > > So this feels like the Perl idiom of using die: ``open(file) or die``

> > "Ouch" on the associated my idea with perl!

> =)  The truth hurts.

Isn't this almost the opposite of "or die"?  Unless I'm having a very
bad day, the die idiom is more like a SystemExit, but this proposal is
a way to recover from expected Exceptions.

    > func(ags) || die(msg)

means

    >>> if not func(args):
    ...     raise SystemExit(msg)

This proposal, with the "a non-dict mapping might not have get"  use case:

    >>> ((mymap[k] except KeyError then default) for key in source)

means

    >>> def __temp():
    ...     for element in source:
    ...         try:
    ...             v=mymap[k]
    ...         except KeyError:
    ...             v=default
    ...         yield v
    >>> __temp()

-jJ

From guido at python.org  Fri Sep  1 16:59:47 2006
From: guido at python.org (Guido van Rossum)
Date: Fri, 1 Sep 2006 07:59:47 -0700
Subject: [Python-3000] Character Set Indepencence
In-Reply-To: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com>
References: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com>
Message-ID: <ca471dc20609010759r48e40cb4t50386888eadd62ca@mail.gmail.com>

I think in a sense Python *will* continue to support multiple
character sets -- as byte streams. IMO that's the only reasonable
approach. Unlike apparently Matz I've never heard complaints that
Python 2 doesn't have enough support for character sets larger than
Unicode, and that is effectively what it supports: encoded strings and
Unicode string.

--Guido

On 9/1/06, Paul Prescod <paul at prescod.net> wrote:
> I thought that others might find this reference interesting. It is Matz (the
> inventor of Ruby) talking about why he thinks that Unicode is good for what
> it does but not sufficient in general, along with some hints of what he
> plans for multinationalization in Ruby. The translation is rough and is
> lifted from this email:
>
> http://rubyforge.org/pipermail/rhg-discussion/2006-April/000136.html
>
> I think that the gist of it is that Unicode will be "just one character set"
> supported by Ruby. This idea has been kicked around for Python before but
> you quickly run into questions about how you compare character strings from
> multiple character sets, to say nothing of the complexity of an character
> encoding and character set agnostic regular expression engine.
>
> I guess Matz is the right guy to experiment with that stuff. Maybe it could
> be copied in Python 4K.
> What are your complaints towards Unicode?
> * it's thoroughly used, isn't it.
> * resentment towards Han unification?
>
> * inferiority complex of Japanese people?
> --
> What are your complaints towards Unicode?
> * no, no I do not have any complaints about Unicode
> * in the domains where Unicode is adequate
> --
> Then, why CSI?
>
>
> In most applications, UCS is enough thanks to Unicode.
> However, there are also applications for which this is not the case.
> --
> Fields for which Unicode is not enough
> Big character sets
> * Konjaku-Mojikyo (Japanese encoding which includes many more than Unicode)
>
> * TRON code
> * GB18030
> --
> Fields for which Unicode is not fitted
> Legacy encodings
> * conversion to UCS is useless
> * big conversion tables
> * round-trip problem
> --
> If a language chooses the UCS system
>
> * you cannot write non-UCS applications
> * you can't handle text that can't be expressed with Unicode
> --
> If a language chooses the CSI system
> * CSI is a superset of UCS
> * Unicode just has to be handled in CSI
>
> --
> ... is what we can say but
> * CSI is difficult
> * can it really be implemented?
> --
> That's where comes out Japan's traditional arts
>
> Adaptation for the Japanese language of applications
> * Modification of English language applications to be able to process
> Japanese
>
> --
> Adaptation for the Japanese language of applications
>
> * What engineers of long ago experienced for sure
>  - Emacs (NEmacs)
>  - Perl (JPerl)
>  - Bash
> --
> Accumulation of know-how
>
> In Japan, the know-how of adaptation for the Japanese language
>
> (multi-byte text processing)
> has been accumulated.
> --
> Accumulation of know-how
>
> in the first place, just for local use,
> text using 3 encodings circulate
> (4 if including UTF-8)
> --
> Based on this know-how
>
> * multibyte text encodings
> * switching between encodings at the string level
> * processing them at practical speed
> is finished
> --
> Available encodings
>
> euc_tw euc_jp iso8859_* utf-8 utf-32le
>
> ascii euc_kr koi8 utf-16le utf-32be
> big5 gb2312 sjis utf-16be
>
> ...and many others
> If it's a stateless encodings, in principle it can be available.
> --
> It means
> For applications using only one encoding, code conversion is not needed
>
> --
> Moreover
> Applications wanting to handle multiple encodings can choose an
> internal encoding (generally Unicode) that includes all others
> --
> If you want to
> * you can also handle multiple encodings without conversion, letting
>
> characters as they are
> * but this is difficult so I do not recommend it
> --
> However,
> only the basic part is done,
> it's far from being ready for practical use
> * code conversion
> * guessing encoding
>
> * etc.
> --
> For the time being, today
> I want to tell everyone:
> * UCS is practical
> * but not all-purpose
> * CSI is not impossible
> --
> The reason I'm saying that
> They may add CSI in Perl6 as they had added
>
> * Methods called by "."
> * Continuations
> from Ruby.
> Basically, they hate losing.
> --
> Thank you
>
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe:
> http://mail.python.org/mailman/options/python-3000/guido%40python.org
>
>
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From mcherm at mcherm.com  Fri Sep  1 17:03:59 2006
From: mcherm at mcherm.com (Michael Chermside)
Date: Fri, 01 Sep 2006 08:03:59 -0700
Subject: [Python-3000] Exception Expressions
Message-ID: <20060901080359.zsxl30h7bpwswc40@login.werra.lunarpages.com>

Calvin Spealman writes:
> I thought I felt in the mood for some abuse today, so I'm proposing
> something sure to give me plenty of crap, but maybe someone will enjoy
> the idea, anyway.
        [...]
>     expr1 except expr2 if exc_type

This is wonderful!

In combination with conditional expressions, list comprehensions, and
lambda, I think this would make it possible to write full-powerd Python
programs on a single line. Actually, putting it on a single line in your
text editor would just make things unreadable, but if you wrap
parentheses around it, then the entire program can be a single
expression, something like this:

     for entry in entryList:
         if entry.status() == 'open':
             try:
                 entry.display()
             except DisplayError:
                 entry.setStatus('error')
                 entry.hide()
         else:
             entry.hide();

would become this:

     (   (   (   entry.display()
                 except (
                     entry.setStatus('error'),
                     entry.hide()
                 )
                 if DisplayError
             )
             if entry.status() == 'open'
             else entry.hide()
         )
         for entry in entryList
     )

(Or you *could* choose to compress it as follows:)

     (((entry.display()except(entry.setStatus('error'
     ),entry.hide())if DisplayError)if entry.status()
     =='open' else entry.hide())for entry in entryList)

Now, I wouldn't try to claim that this single-expression
version is *more* readable than the original, but it
has a significant advantage: it makes the language no
longer dependent on significant whitespace for demarking
lines and blocks! There are places where significant
whitespace is a problem, most notably when trying to
embed Python code within other documents.

Just imagine using this new form to embed Python
within HTML to create a new and more powerful form of
dynamic page generation:

    <div class="entrydiv">
       <p class="item-title"><*entry.title()*></p>
       <ul>
          <* "<li>Valid</li>" if entry.isvalid() else "" *>
          <* "<li>Active</li>" if entry.active else "<li>Inactive</li>" *>
       </ul>
       <p class="item-content">
          <* entry.showContent() except "No Data Available" if Exception *>
       </p>
    </div>

Isn't it amazing?

.
.
.

Okay... *everything* above comes with a HUGE wink. It's
a joke. Calvin's idea is clever, and readable once you get
used to conditional expressions, but I'm still a solid -1
on the proposal. But thanks for giving me something fun to
think about.

-- Michael Chermside


From nnorwitz at gmail.com  Fri Sep  1 18:58:49 2006
From: nnorwitz at gmail.com (Neal Norwitz)
Date: Fri, 1 Sep 2006 09:58:49 -0700
Subject: [Python-3000] string C API
In-Reply-To: <ed968c$d03$1@sea.gmane.org>
References: <ed968c$d03$1@sea.gmane.org>
Message-ID: <ee2a432c0609010958x167a61a0w8cb75522d885717c@mail.gmail.com>

On 9/1/06, Fredrik Lundh <fredrik at pythonware.com> wrote:
> just noticed that PEP 3100 says that PyString_AsEncodedString and
> PyString_AsDecodedString is to be removed, but it doesn't mention
> any other PyString (or PyUnicode) functions.
>
> how large changes can we make here, really ?

I don't know if it was the case here or not, but I added a bunch of
APIs to the PEP that were labeled as deprecated or only for backwards
compatibility.  The sources were the doc, header files, and source
files.  (There's no single place to look.)

n

From guido at python.org  Fri Sep  1 19:17:39 2006
From: guido at python.org (Guido van Rossum)
Date: Fri, 1 Sep 2006 10:17:39 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ed8pd9$ch$1@sea.gmane.org>
References: <ed8pd9$ch$1@sea.gmane.org>
Message-ID: <ca471dc20609011017k17837255qa8774a335b8d4ed5@mail.gmail.com>

I say not at all.

On 9/1/06, Fredrik Lundh <fredrik at pythonware.com> wrote:
> today's Python supports "locale aware" 8-bit strings; e.g.
>
>     >>> import locale
>     >>> "???".isalpha()
>     False
>     >>> locale.setlocale(locale.LC_ALL, "sv_SE")
>     'sv_SE'
>     >>> "???".isalpha()
>     True
>
> to what extent should this be supported by Python 3000 ?
>
> </F>
>
>
>
>
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>
>
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From g.brandl at gmx.net  Fri Sep  1 20:34:09 2006
From: g.brandl at gmx.net (Georg Brandl)
Date: Fri, 01 Sep 2006 20:34:09 +0200
Subject: [Python-3000] Ripping out exec
Message-ID: <ed9uf1$76g$1@sea.gmane.org>

Hi,

in process of ripping out the exec statement, I stumbled over the
following function in symtable.c (line 468ff):

------------------------------------------------------------------------------------
/* Check for illegal statements in unoptimized namespaces */
static int
check_unoptimized(const PySTEntryObject* ste) {
	char buf[300];
	const char* trailer;

	if (ste->ste_type != FunctionBlock || !ste->ste_unoptimized
	    || !(ste->ste_free || ste->ste_child_free))
		return 1;

	trailer = (ste->ste_child_free ?
		       "contains a nested function with free variables" :
			       "is a nested function");

	switch (ste->ste_unoptimized) {
	case OPT_TOPLEVEL: /* exec / import * at top-level is fine */
	case OPT_EXEC: /* qualified exec is fine */
		return 1;
	case OPT_IMPORT_STAR:
		PyOS_snprintf(buf, sizeof(buf),
			      "import * is not allowed in function '%.100s' "
			      "because it is %s",
			      PyString_AS_STRING(ste->ste_name), trailer);
		break;
	case OPT_BARE_EXEC:
		PyOS_snprintf(buf, sizeof(buf),
			      "unqualified exec is not allowed in function "
			      "'%.100s' it %s",
			      PyString_AS_STRING(ste->ste_name), trailer);
		break;
	default:
		PyOS_snprintf(buf, sizeof(buf),
			      "function '%.100s' uses import * and bare exec, "
			      "which are illegal because it %s",
			      PyString_AS_STRING(ste->ste_name), trailer);
		break;
	}

	PyErr_SetString(PyExc_SyntaxError, buf);
	PyErr_SyntaxLocation(ste->ste_table->st_filename,
			     ste->ste_opt_lineno);
	return 0;
}
--------------------------------------------------------------------------------------

Of course, this check can't be made at compile time if exec() is a function.
(You can even outsmart it currently by giving explicit None arguments to the
exec statement)

So my question is: is this check required, and can it be done at execution time
instead?

Comparing the exec code to execfile(), only this can be the cause for the
extra precaution:
(from Python/ceval.c, function exec_statement)

	if (plain)
		PyFrame_LocalsToFast(f, 0);

Georg


From guido at python.org  Fri Sep  1 20:37:55 2006
From: guido at python.org (Guido van Rossum)
Date: Fri, 1 Sep 2006 11:37:55 -0700
Subject: [Python-3000] Ripping out exec
In-Reply-To: <ed9uf1$76g$1@sea.gmane.org>
References: <ed9uf1$76g$1@sea.gmane.org>
Message-ID: <ca471dc20609011137m3b51cb5dr94b216222dc40bb8@mail.gmail.com>

I would just rip it out.

On 9/1/06, Georg Brandl <g.brandl at gmx.net> wrote:
> Hi,
>
> in process of ripping out the exec statement, I stumbled over the
> following function in symtable.c (line 468ff):
>
> ------------------------------------------------------------------------------------
> /* Check for illegal statements in unoptimized namespaces */
> static int
> check_unoptimized(const PySTEntryObject* ste) {
>         char buf[300];
>         const char* trailer;
>
>         if (ste->ste_type != FunctionBlock || !ste->ste_unoptimized
>             || !(ste->ste_free || ste->ste_child_free))
>                 return 1;
>
>         trailer = (ste->ste_child_free ?
>                        "contains a nested function with free variables" :
>                                "is a nested function");
>
>         switch (ste->ste_unoptimized) {
>         case OPT_TOPLEVEL: /* exec / import * at top-level is fine */
>         case OPT_EXEC: /* qualified exec is fine */
>                 return 1;
>         case OPT_IMPORT_STAR:
>                 PyOS_snprintf(buf, sizeof(buf),
>                               "import * is not allowed in function '%.100s' "
>                               "because it is %s",
>                               PyString_AS_STRING(ste->ste_name), trailer);
>                 break;
>         case OPT_BARE_EXEC:
>                 PyOS_snprintf(buf, sizeof(buf),
>                               "unqualified exec is not allowed in function "
>                               "'%.100s' it %s",
>                               PyString_AS_STRING(ste->ste_name), trailer);
>                 break;
>         default:
>                 PyOS_snprintf(buf, sizeof(buf),
>                               "function '%.100s' uses import * and bare exec, "
>                               "which are illegal because it %s",
>                               PyString_AS_STRING(ste->ste_name), trailer);
>                 break;
>         }
>
>         PyErr_SetString(PyExc_SyntaxError, buf);
>         PyErr_SyntaxLocation(ste->ste_table->st_filename,
>                              ste->ste_opt_lineno);
>         return 0;
> }
> --------------------------------------------------------------------------------------
>
> Of course, this check can't be made at compile time if exec() is a function.
> (You can even outsmart it currently by giving explicit None arguments to the
> exec statement)
>
> So my question is: is this check required, and can it be done at execution time
> instead?
>
> Comparing the exec code to execfile(), only this can be the cause for the
> extra precaution:
> (from Python/ceval.c, function exec_statement)
>
>         if (plain)
>                 PyFrame_LocalsToFast(f, 0);
>
> Georg
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From jcarlson at uci.edu  Fri Sep  1 21:20:21 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Fri, 01 Sep 2006 12:20:21 -0700
Subject: [Python-3000] "string" views
Message-ID: <20060901120313.1B5F.JCARLSON@uci.edu>


Attached you will find a zip file containing the implementation of a
'stringview' object written against Python 2.3 and Pyrex 0.9.3 .  I
didn't implement center, decode, encode, ljust, rjust, splitlines, title,
translate, zfill, __[r]mod__, slicing with indices != 1, my optimization
for view.join(...) doesn't seem to work, and view.split('') is also not
implemented.

I'm stopping for right now because I'm a bit burnt out on this
particular project.  If it seems hacked together, it is because it is
hacked together.  Whenever possible it returns views.  It also will
generally take anything that supports the buffer protocol as an argument
where a string or view would have also made sense.

Please remember that this is just a proof-of-concept implementation; I
would imagine that an actual view object would likely need to be written
in pure C, and though I have tested each method by hand, there may be
bugs.

I have also included the output file "stringview.c" for those without a
working Pyrex installation, which should compile against Python 2.3
headers, and perhaps even 2.4 headers.


 - Josiah
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stringview.zip
Type: application/x-zip-compressed
Size: 27685 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20060901/46f8ee58/attachment-0001.bin 

From g.brandl at gmx.net  Fri Sep  1 23:28:15 2006
From: g.brandl at gmx.net (Georg Brandl)
Date: Fri, 01 Sep 2006 23:28:15 +0200
Subject: [Python-3000] Ripping out exec
In-Reply-To: <ca471dc20609011137m3b51cb5dr94b216222dc40bb8@mail.gmail.com>
References: <ed9uf1$76g$1@sea.gmane.org>
	<ca471dc20609011137m3b51cb5dr94b216222dc40bb8@mail.gmail.com>
Message-ID: <eda8lg$80n$1@sea.gmane.org>

Guido van Rossum wrote:
> I would just rip it out.

It turns out that it's not so easy. The exec statement currently can
modify the locals, which means that

def f():
     exec "a=1"
     print a

succeeds. To make that possible, the compiler flags scopes containing
exec statements as unoptimized and does not assume unbound names to
be global.

With exec being a function, currently the above function won't work
because "a" is assumed to be global.

I can see only two resolutions:

* change exec() semantics so that it cannot modify the locals
* do not make exec a function

Georg


From guido at python.org  Fri Sep  1 23:57:18 2006
From: guido at python.org (Guido van Rossum)
Date: Fri, 1 Sep 2006 14:57:18 -0700
Subject: [Python-3000] Ripping out exec
In-Reply-To: <eda8lg$80n$1@sea.gmane.org>
References: <ed9uf1$76g$1@sea.gmane.org>
	<ca471dc20609011137m3b51cb5dr94b216222dc40bb8@mail.gmail.com>
	<eda8lg$80n$1@sea.gmane.org>
Message-ID: <ca471dc20609011457x247cee4ch9fb164745e24e33d@mail.gmail.com>

On 9/1/06, Georg Brandl <g.brandl at gmx.net> wrote:
> Guido van Rossum wrote:
> > I would just rip it out.
>
> It turns out that it's not so easy. The exec statement currently can
> modify the locals, which means that
>
> def f():
>      exec "a=1"
>      print a
>
> succeeds. To make that possible, the compiler flags scopes containing
> exec statements as unoptimized and does not assume unbound names to
> be global.
>
> With exec being a function, currently the above function won't work
> because "a" is assumed to be global.
>
> I can see only two resolutions:
>
> * change exec() semantics so that it cannot modify the locals
> * do not make exec a function

Make it so it can't modify the locals. execfile() has the same limitation.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From g.brandl at gmx.net  Sat Sep  2 00:37:20 2006
From: g.brandl at gmx.net (Georg Brandl)
Date: Sat, 02 Sep 2006 00:37:20 +0200
Subject: [Python-3000] Ripping out exec
In-Reply-To: <ca471dc20609011457x247cee4ch9fb164745e24e33d@mail.gmail.com>
References: <ed9uf1$76g$1@sea.gmane.org>	<ca471dc20609011137m3b51cb5dr94b216222dc40bb8@mail.gmail.com>	<eda8lg$80n$1@sea.gmane.org>
	<ca471dc20609011457x247cee4ch9fb164745e24e33d@mail.gmail.com>
Message-ID: <edacn0$j49$1@sea.gmane.org>

Guido van Rossum wrote:
> On 9/1/06, Georg Brandl <g.brandl at gmx.net> wrote:
>> Guido van Rossum wrote:
>> > I would just rip it out.
>>
>> It turns out that it's not so easy. The exec statement currently can
>> modify the locals, which means that
>>
>> def f():
>>      exec "a=1"
>>      print a
>>
>> succeeds. To make that possible, the compiler flags scopes containing
>> exec statements as unoptimized and does not assume unbound names to
>> be global.
>>
>> With exec being a function, currently the above function won't work
>> because "a" is assumed to be global.
>>
>> I can see only two resolutions:
>>
>> * change exec() semantics so that it cannot modify the locals
>> * do not make exec a function
> 
> Make it so it can't modify the locals. execfile() has the same limitation.
> 

Good. Patch is at python.org/sf/1550800.

There's another one at python.org/sf/1550786 implementing the Ellipsis literal.

cheers,
Georg


From greg.ewing at canterbury.ac.nz  Sat Sep  2 02:10:55 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sat, 02 Sep 2006 12:10:55 +1200
Subject: [Python-3000] Making more effective use of slice objects in Py3k
In-Reply-To: <44F7A557.2010002@acm.org>
References: <20060827184941.1AE8.JCARLSON@uci.edu> <ed1q7r$v4s$2@sea.gmane.org>
	<20060829102307.1B0F.JCARLSON@uci.edu> <ed1uds$iog$1@sea.gmane.org>
	<ed3iq2$9iv$1@sea.gmane.org>
	<6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com>
	<20060831044354.GH6257@performancedrivers.com>
	<44F72E75.2050204@acm.org>
	<ca471dc20608311155i89d671dtdf99907674cbf87d@mail.gmail.com>
	<44F7A557.2010002@acm.org>
Message-ID: <44F8CC0F.2020004@canterbury.ac.nz>

Talin wrote:

> So for example, any string operation which produces a subset of the 
> string (such as partition, split, index, slice, etc.) will produce a 
> string of the same width as the original string.

It might be possible to represent it in a narrower format,
however. Perhaps there should be an explicit operation for
re-packing a string into the narrowest possible format?
Or should one simply encode it as UTF-8 or something and
then decode it again to get the same effect?

--
Greg

From greg.ewing at canterbury.ac.nz  Sat Sep  2 02:37:02 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sat, 02 Sep 2006 12:37:02 +1200
Subject: [Python-3000] Ripping out exec
In-Reply-To: <ca471dc20609011137m3b51cb5dr94b216222dc40bb8@mail.gmail.com>
References: <ed9uf1$76g$1@sea.gmane.org>
	<ca471dc20609011137m3b51cb5dr94b216222dc40bb8@mail.gmail.com>
Message-ID: <44F8D22E.70202@canterbury.ac.nz>

Guido van Rossum wrote:
> I would just rip it out.

I don't understand this business about ripping out
exec. I thought that exec had to be a statement so
the compiler can tell whether to use fast locals.
Do you have a different way of handling that in mind
for Py3k?

--
Greg

From guido at python.org  Sat Sep  2 04:26:46 2006
From: guido at python.org (Guido van Rossum)
Date: Fri, 1 Sep 2006 19:26:46 -0700
Subject: [Python-3000] Ripping out exec
In-Reply-To: <44F8D22E.70202@canterbury.ac.nz>
References: <ed9uf1$76g$1@sea.gmane.org>
	<ca471dc20609011137m3b51cb5dr94b216222dc40bb8@mail.gmail.com>
	<44F8D22E.70202@canterbury.ac.nz>
Message-ID: <ca471dc20609011926s44e8a0f8v5e2db668556e7e44@mail.gmail.com>

On 9/1/06, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Guido van Rossum wrote:
> > I would just rip it out.
>
> I don't understand this business about ripping out
> exec. I thought that exec had to be a statement so
> the compiler can tell whether to use fast locals.
> Do you have a different way of handling that in mind
> for Py3k?

Yes. If we implement the module-level analysis it should be easy
enough to track whether 'exec' refers to the built-in function. (We're
already planning to add some kind of prohibition against outside
modules poking new globals into a module that shadow built-ins.)

But I also see no bones in requiring the use of a dict arg if you want
to observe the side effects of the exec'ed code. So instead of

  def f(s):
    exec s
    print a # presumably s must contain an assignment to a

you'd have to write

  def f(s):
    ns = {}
    exec(s, ns)
    print ns['a']

This makes it a lot clearer what happens IMO.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From ncoghlan at gmail.com  Sat Sep  2 05:42:45 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sat, 02 Sep 2006 13:42:45 +1000
Subject: [Python-3000] Exception Expressions
In-Reply-To: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com>
References: <76fd5acf0608311042k231fb36w1bf5d1e7e4eebe0c@mail.gmail.com>
Message-ID: <44F8FDB5.6000808@gmail.com>

An interesting idea, although I suspect a leading try keyword would make 
things clearer.

   (try expr1 except expr2 if exc_type)

print (try letters[7] except "N/A" if IndexError)
f = (try open(filename) except open(filename2) if IOError)
print (try eval(expr) except "Can not divide by zero!" if ZeroDivisionError)
val = (try db.get(key) except cache.get(key) if TimeoutError)

This wouldn't help the chaining problem that Greg pointed out, though:

try open(name1) except (try open(name2) except open(name3) if IOError) if IOError

Using a different keyword or a comma so expr2 comes last as Greg suggested 
would fix that:

try open(name1) except IOError, (try open(name2) except IOError, open(name3))

I'd be somewhere between -1 and -0 at this point in time. Depending on the 
results a review of the standard library describing actual use cases that 
could be made easier to read might be enough to get me to a +0.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From ncoghlan at gmail.com  Sat Sep  2 05:47:57 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sat, 02 Sep 2006 13:47:57 +1000
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ed8pd9$ch$1@sea.gmane.org>
References: <ed8pd9$ch$1@sea.gmane.org>
Message-ID: <44F8FEED.9000600@gmail.com>

Fredrik Lundh wrote:
> today's Python supports "locale aware" 8-bit strings; e.g.
> 
>     >>> import locale
>     >>> "???".isalpha()
>     False
>     >>> locale.setlocale(locale.LC_ALL, "sv_SE")
>     'sv_SE'
>     >>> "???".isalpha()
>     True
> 
> to what extent should this be supported by Python 3000 ?

Since all strings will be Unicode by then:

 >>> u"???".isalpha()
True

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From qrczak at knm.org.pl  Sat Sep  2 09:57:11 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sat, 02 Sep 2006 09:57:11 +0200
Subject: [Python-3000] Making more effective use of slice objects in Py3k
In-Reply-To: <44F8CC0F.2020004@canterbury.ac.nz> (Greg Ewing's message of
	"Sat, 02 Sep 2006 12:10:55 +1200")
References: <20060827184941.1AE8.JCARLSON@uci.edu>
	<ed1q7r$v4s$2@sea.gmane.org> <20060829102307.1B0F.JCARLSON@uci.edu>
	<ed1uds$iog$1@sea.gmane.org> <ed3iq2$9iv$1@sea.gmane.org>
	<6a36e7290608302056v4b0e68abrfe0c5b1fc927ff@mail.gmail.com>
	<20060831044354.GH6257@performancedrivers.com>
	<44F72E75.2050204@acm.org>
	<ca471dc20608311155i89d671dtdf99907674cbf87d@mail.gmail.com>
	<44F7A557.2010002@acm.org> <44F8CC0F.2020004@canterbury.ac.nz>
Message-ID: <87mz9izoo8.fsf@qrnik.zagroda>

Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

> It might be possible to represent it in a narrower format,
> however. Perhaps there should be an explicit operation for
> re-packing a string into the narrowest possible format?

I suppose it's better to always normalize a polymorphic string
representation. And always normalize bignums to fixnums (long->int).

It increases chances of using the more compact representation.
It doesn't add any asymptotic cost, it's done when the whole
object is to be allocated anyway (these are immutable objects).
It simplifies equality comparison.

The narrow formats should be statistically more common than wide
formats anyway.

Programmers should not be expected to care about explicitly calling
a normalization function.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From tomerfiliba at gmail.com  Sat Sep  2 17:53:59 2006
From: tomerfiliba at gmail.com (tomer filiba)
Date: Sat, 2 Sep 2006 17:53:59 +0200
Subject: [Python-3000] encoding hell
Message-ID: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>

i'm quite finished with the base of iostack (streams and layers), and
have moved to implementing the adpaters layer (especially the dreaded
TextAdapter).

as was discussed earlier, streams and layers work with bytes, while
adpaters may work with arbitrary objects (be it struct-style records,
serialized objects, characters and whatnot).

the question that arises is -- how far should we stretch this abstraction?
for example, the TextAdapter reads and writes characters to the
stream, after they go encoding or decoding, so from the programmer's
point of view, he's working with *characters*, not *bytes*.
that means the programmer need not be aware of how the characters
are "physically" stored in the underlying stream.

that's all very nice, but what do we do when it comes to seek()ing?
do you want to seek by character position or by byte position?
logically you are working with characters, but it would be impossible
to implement without first decoding the entire stream in-memory...
which is unacceptable of course.

and if seek()ing is byte-oriented, then you must somehow seek
only to the beginning of a multibyte character sequence... how
would you do that?

my solution would be completely leaving seek() and tell() out of the
3rd layer -- it's a byte-level operation.

anyone thinks differently? if so, what's your solution?

- - - -

you can find the latest sources here (note: i haven't tested it yet,
many things are likely to be broken, it's still being redesigned):
http://sebulbasvn.googlecode.com/svn/trunk/iostack/
http://sebulbasvn.googlecode.com/svn/trunk/sock2/


-tomer

From g.brandl at gmx.net  Sat Sep  2 18:36:37 2006
From: g.brandl at gmx.net (Georg Brandl)
Date: Sat, 02 Sep 2006 18:36:37 +0200
Subject: [Python-3000] The future of exceptions
Message-ID: <edcbqh$63c$1@sea.gmane.org>

While looking at the changes necessary to implement the exception
related syntax changes (except ... as ..., raise without type),
I came across some more substantial things that I think must be discussed.

* How should exceptions be represented in C code? Should there still
  be a (type, value, traceback) triple?

* Could the traceback be made an attribute of the exception?

* What about exception chaining?

Something like this comes to mind::

    try:
        whatever
    except ValueError as err:
        raise CustomException("Something went wrong", prev=err)

With tracebacks becoming part of the exception, that could be::

    raise CustomException(*args, prev=err, tb=traceback)

(`prev` and `tb` would be keyword-only arguments)

With that, all exception info would be contained in one object,
so sys.exc_info() could be renamed to sys.last_exc().

cheers,
Georg


From qrczak at knm.org.pl  Sat Sep  2 20:04:08 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sat, 02 Sep 2006 20:04:08 +0200
Subject: [Python-3000] The future of exceptions
In-Reply-To: <edcbqh$63c$1@sea.gmane.org> (Georg Brandl's message of "Sat,
	02 Sep 2006 18:36:37 +0200")
References: <edcbqh$63c$1@sea.gmane.org>
Message-ID: <87pseew3fr.fsf@qrnik.zagroda>

Georg Brandl <g.brandl at gmx.net> writes:

> * Could the traceback be made an attribute of the exception?
>
> * What about exception chaining?
>
> Something like this comes to mind::
>
>     try:
>         whatever
>     except ValueError as err:
>         raise CustomException("Something went wrong", prev=err)

In my language the traceback is materialized from the stack only
if needed (typically when an exception escapes from the toplevel),
and it includes the history of other exceptions thrown from exception
handlers, intermingled with source locations. The stack is not
physically unwound until an exception handler completes successfully,
so the data is available until then.

For example the above (without storing prev) would include:
- locations of active functions leading to whatever
- the location of whatever when the value error is raised
- exception: the ValueError instance
- the location of raise CustomException
- exception: the CustomException instance

Printing the stack trace recognizes when the same exception object is
reraised again, and prints this as a propagation instead of repeating
the exception description.

Of course this design is suitable only if the previous exception
is used merely for printing the stack trace, not for unpacking and
examining by the program.

I don't know how Python stack traces are implemented, so I have no
idea whether this would be practical for Python, assuming it would be
desirable at all.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From talin at acm.org  Sat Sep  2 22:23:32 2006
From: talin at acm.org (Talin)
Date: Sat, 02 Sep 2006 13:23:32 -0700
Subject: [Python-3000] encoding hell
In-Reply-To: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
Message-ID: <44F9E844.2020603@acm.org>

tomer filiba wrote:
> i'm quite finished with the base of iostack (streams and layers), and
> have moved to implementing the adpaters layer (especially the dreaded
> TextAdapter).
> 
> as was discussed earlier, streams and layers work with bytes, while
> adpaters may work with arbitrary objects (be it struct-style records,
> serialized objects, characters and whatnot).
> 
> the question that arises is -- how far should we stretch this abstraction?
> for example, the TextAdapter reads and writes characters to the
> stream, after they go encoding or decoding, so from the programmer's
> point of view, he's working with *characters*, not *bytes*.
> that means the programmer need not be aware of how the characters
> are "physically" stored in the underlying stream.
> 
> that's all very nice, but what do we do when it comes to seek()ing?
> do you want to seek by character position or by byte position?
> logically you are working with characters, but it would be impossible
> to implement without first decoding the entire stream in-memory...
> which is unacceptable of course.
> 
> and if seek()ing is byte-oriented, then you must somehow seek
> only to the beginning of a multibyte character sequence... how
> would you do that?
> 
> my solution would be completely leaving seek() and tell() out of the
> 3rd layer -- it's a byte-level operation.
> 
> anyone thinks differently? if so, what's your solution?

Well, for comparison with other APIs:

The .Net equivalent, System.IO.TextReader, does not have a "seek" method 
at all.

The Java version, Java.io.BufferedReader, has a "skip()" method which 
only allows seeking forward.

Sounds to me like copying the Java model would work.

-- Talin

From brett at python.org  Sat Sep  2 22:44:00 2006
From: brett at python.org (Brett Cannon)
Date: Sat, 2 Sep 2006 13:44:00 -0700
Subject: [Python-3000] The future of exceptions
In-Reply-To: <edcbqh$63c$1@sea.gmane.org>
References: <edcbqh$63c$1@sea.gmane.org>
Message-ID: <bbaeab100609021344t2e35c04w37dcebcd3060dad8@mail.gmail.com>

On 9/2/06, Georg Brandl <g.brandl at gmx.net> wrote:
>
> While looking at the changes necessary to implement the exception
> related syntax changes (except ... as ..., raise without type),
> I came across some more substantial things that I think must be discussed.


You have read Ping's PEP 344, right?

* How should exceptions be represented in C code? Should there still
>   be a (type, value, traceback) triple?
>
> * Could the traceback be made an attribute of the exception?


The problem with this is that it keeps the frame alive.  This is why this
and exception chaining were considered a design issue in Ping's PEP since
that is a lot of stuff to keep alive.

* What about exception chaining?
>
> Something like this comes to mind::
>
>     try:
>         whatever
>     except ValueError as err:
>         raise CustomException("Something went wrong", prev=err)
>
> With tracebacks becoming part of the exception, that could be::
>
>     raise CustomException(*args, prev=err, tb=traceback)
>
> (`prev` and `tb` would be keyword-only arguments)
>
> With that, all exception info would be contained in one object,
> so sys.exc_info() could be renamed to sys.last_exc().


Right, which is why the original suggestion came up in the first place.  It
would be nice to compartmentalize exceptions entirely, but the worry of
keeping a ont of memory alive for it needs to be addressed, especially if
exceptions are to be kept lightweight and usable for things other than
flagging errors.

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060902/9caf2f96/attachment.html 

From tomerfiliba at gmail.com  Sun Sep  3 00:29:25 2006
From: tomerfiliba at gmail.com (tomer filiba)
Date: Sun, 3 Sep 2006 00:29:25 +0200
Subject: [Python-3000] encoding hell
In-Reply-To: <44F9E844.2020603@acm.org>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
	<44F9E844.2020603@acm.org>
Message-ID: <1d85506f0609021529o3a83dccbod0a7a643d39da696@mail.gmail.com>

[Talin]
> The Java version, Java.io.BufferedReader, has a "skip()" method which
> only allows seeking forward.
> Sounds to me like copying the Java model would work.

then there's no need for it at all... just read() and discard the return value.
we don't need a special API for that.

on the other hand, the .NET version has a BaseStream attribute holding
the underlying stream over which the StreamReader operates... this
means you *can* change the position if the underlying stream supports
seeking.

i read through the msdn but found no explicit definition for what happens
in the case of seeking in text-encoded streams, but they noted
somewhere they use a "best fit" decoder, which, to the best of my
understanding, may skip some bytes until it's in synch with the stream.

that's a *horrible* design, imho, but that's microsoft. i say let's leave it
below layer 3, at the byte level. if users find seeking very important,
we can come up with a layer-2 ReSyncLayer, which will attempt to
come in synch with a specified encoding.

for example:

f = TextAdapter(
    ReSyncLayer(
        BufferedLayer(
            FileStream("blah", "r")
        ),
        encoding = "utf8"
    ),
    encoding = "utf8"
)

# read 3 UTF8 *characters*
f.read(3)

# this will seek by AT LEAST 7 *bytes*, until resynched
f.substream.seekby(7)

# we can resume reading of UTF8 *characters*
f.read(3)

heck, i even like this idea :)
thanks for the pointers.


-tomer

On 9/2/06, Talin <talin at acm.org> wrote:
> tomer filiba wrote:
> > i'm quite finished with the base of iostack (streams and layers), and
> > have moved to implementing the adpaters layer (especially the dreaded
> > TextAdapter).
> >
> > as was discussed earlier, streams and layers work with bytes, while
> > adpaters may work with arbitrary objects (be it struct-style records,
> > serialized objects, characters and whatnot).
> >
> > the question that arises is -- how far should we stretch this abstraction?
> > for example, the TextAdapter reads and writes characters to the
> > stream, after they go encoding or decoding, so from the programmer's
> > point of view, he's working with *characters*, not *bytes*.
> > that means the programmer need not be aware of how the characters
> > are "physically" stored in the underlying stream.
> >
> > that's all very nice, but what do we do when it comes to seek()ing?
> > do you want to seek by character position or by byte position?
> > logically you are working with characters, but it would be impossible
> > to implement without first decoding the entire stream in-memory...
> > which is unacceptable of course.
> >
> > and if seek()ing is byte-oriented, then you must somehow seek
> > only to the beginning of a multibyte character sequence... how
> > would you do that?
> >
> > my solution would be completely leaving seek() and tell() out of the
> > 3rd layer -- it's a byte-level operation.
> >
> > anyone thinks differently? if so, what's your solution?
>
> Well, for comparison with other APIs:
>
> The .Net equivalent, System.IO.TextReader, does not have a "seek" method
> at all.
>
> The Java version, Java.io.BufferedReader, has a "skip()" method which
> only allows seeking forward.
>
> Sounds to me like copying the Java model would work.
>
> -- Talin
>

From greg.ewing at canterbury.ac.nz  Sun Sep  3 01:06:01 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sun, 03 Sep 2006 11:06:01 +1200
Subject: [Python-3000] encoding hell
In-Reply-To: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
Message-ID: <44FA0E59.9010302@canterbury.ac.nz>

tomer filiba wrote:

> my solution would be completely leaving seek() and tell() out of the
> 3rd layer -- it's a byte-level operation.

That's what I'd recommend, too. Seeking doesn't make
sense when the underlying units aren't fixed-length.

The best you could do would be to return some kind
of opaque object from tell() that could be passed
back to seek(). But I'm far from convinced that
would be worth the trouble.

--
Greg


From ironfroggy at gmail.com  Sun Sep  3 02:24:22 2006
From: ironfroggy at gmail.com (Calvin Spealman)
Date: Sat, 2 Sep 2006 20:24:22 -0400
Subject: [Python-3000] The future of exceptions
In-Reply-To: <bbaeab100609021344t2e35c04w37dcebcd3060dad8@mail.gmail.com>
References: <edcbqh$63c$1@sea.gmane.org>
	<bbaeab100609021344t2e35c04w37dcebcd3060dad8@mail.gmail.com>
Message-ID: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com>

On 9/2/06, Brett Cannon <brett at python.org> wrote:
> Right, which is why the original suggestion came up in the first place.  It
> would be nice to compartmentalize exceptions entirely, but the worry of
> keeping a ont of memory alive for it needs to be addressed, especially if
> exceptions are to be kept lightweight and usable for things other than
> flagging errors.
>
> -Brett

So, at issue is attaching tracebacks to exceptions keeps too much
alive and thus makes exceptions too heavy? If the traceback was passed
to the exception constructor and then held as an attribute of the
exception, any exception meant for "light" work (ie., not normal error
flagging) could simply decided not to include the traceback, and so it
would be destroyed, removing the weight from the exception. Similarly,
tracebacks could have some lean() method to drop references to the
frames.

From brett at python.org  Sun Sep  3 03:34:47 2006
From: brett at python.org (Brett Cannon)
Date: Sat, 2 Sep 2006 18:34:47 -0700
Subject: [Python-3000] The future of exceptions
In-Reply-To: <76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com>
References: <edcbqh$63c$1@sea.gmane.org>
	<bbaeab100609021344t2e35c04w37dcebcd3060dad8@mail.gmail.com>
	<76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com>
Message-ID: <bbaeab100609021834x17e818e4u3b68c15f1b4bd776@mail.gmail.com>

On 9/2/06, Calvin Spealman <ironfroggy at gmail.com> wrote:
>
> On 9/2/06, Brett Cannon <brett at python.org> wrote:
> > Right, which is why the original suggestion came up in the first
> place.  It
> > would be nice to compartmentalize exceptions entirely, but the worry of
> > keeping a ont of memory alive for it needs to be addressed, especially
> if
> > exceptions are to be kept lightweight and usable for things other than
> > flagging errors.
> >
> > -Brett
>
> So, at issue is attaching tracebacks to exceptions keeps too much
> alive and thus makes exceptions too heavy?


Basically.  Memory usage goes up if you do this as it stands now.

If the traceback was passed
> to the exception constructor and then held as an attribute of the
> exception, any exception meant for "light" work (ie., not normal error
> flagging) could simply decided not to include the traceback, and so it
> would be destroyed, removing the weight from the exception. Similarly,
> tracebacks could have some lean() method to drop references to the
> frames.
>


Problem with that is you then lose any API guarantees of the traceback being
there, which would mean you would still need to keep around sys.exc_info().

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060902/060a9cf8/attachment.html 

From fredrik at pythonware.com  Sun Sep  3 11:19:06 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Sun, 03 Sep 2006 11:19:06 +0200
Subject: [Python-3000] encoding hell
In-Reply-To: <44FA0E59.9010302@canterbury.ac.nz>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
	<44FA0E59.9010302@canterbury.ac.nz>
Message-ID: <ede6m9$c9g$1@sea.gmane.org>

Greg Ewing wrote:

> The best you could do would be to return some kind
> of opaque object from tell() that could be passed
> back to seek().

that's how seek/tell works on text files in today's Python, of course. 
if you're writing portable code, you can only seek to the beginning or 
end of the file, or to a position returned to you by tell.

</F>


From 2006 at jmunch.dk  Sun Sep  3 19:11:27 2006
From: 2006 at jmunch.dk (Anders J. Munch)
Date: Sun, 03 Sep 2006 19:11:27 +0200
Subject: [Python-3000] encoding hell
In-Reply-To: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
Message-ID: <44FB0CBF.7070102@jmunch.dk>

tomer filiba wrote:
 > my solution would be completely leaving seek() and tell() out of the
 > 3rd layer -- it's a byte-level operation.
 >
 > anyone thinks differently? if so, what's your solution?

seek and tell are a poor mans sequence.  I would have nothing by those
names.

I would have input streams, output streams and sequences, and I
wouldn't mix the three.  FileReader would be an InputStream,
FileWriter would be an OutputStream.  FileBytes would support the
sequence protocol, mimicking bytes objects.  It would support
random-access read and write using __getitem__ and __setitem__,
allowing slice assignment for slices of equal size.  And there would
be append() to extend the file, and partial __delitem__ support for
truncating.

Looking at your iostack2 Stream class, no sooner do you introduce the
key methods read and write, than you supplement them with capability
queries readable and writable that check whether these methods may
even be called.  IMO this is a clear indication that these methods
really want to be refactored into separate classes.

I think you'll find that separating input, output and random access
into three separate ADTs will much simplify BufferingLayer (even
though you'll need three of them).  At least if you intend to take
interactions between reads and writes into account.

regards,
Anders


From tomerfiliba at gmail.com  Sun Sep  3 20:17:39 2006
From: tomerfiliba at gmail.com (tomer filiba)
Date: Sun, 3 Sep 2006 20:17:39 +0200
Subject: [Python-3000] encoding hell
In-Reply-To: <44FB0CBF.7070102@jmunch.dk>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
	<44FB0CBF.7070102@jmunch.dk>
Message-ID: <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com>

> FileReader would be an InputStream,
> FileWriter would be an OutputStream

yes, this has been discussed, but that's too java-ish by nature.
besides, how would this model handle a simple operation, such as
file("foo", "w+") ?

opening TWO file descriptors for that purpose, one for reading and
another for writing, is a complete waste of resources: handles are not
cheap. not to mention that opening the same file multiple times may
run you into platform-specific pits, like read-after-write bugs, etc.

so the obvious solution is having an underlying "file-like object",
which is basically like today's file (supports read() AND write()),
over which InputStream and OutputStream just expose a different
view of:

f = file(...)
fr = FileReader(f)
fw = FileWriter(f)
fr.read()
fw.write()

now, this means you start with a "capable" object like file, with all of
the desired operations, and you intentionally CRIPPLE it down into
separate reading and writing front-ends.

so what's sense does that make? if you want an InputStream, just be
sure you only call read() or readall(); if you want an OutputStream
limit yourself to caling write(). input-only/output-only streams are
just silly and artificial overhead -- we don't need them.

the java/.NET world relies on interfaces so much that it might make
sense in that context. but that's not the python way.

> no sooner do you introduce the
> key methods read and write, than you supplement them with capability
> queries readable and writable that check whether these methods may
> even be called. IMO this is a clear indication that these methods
> really want to be refactored into separate classes.

the reason is some streams, like pipes or partially shutdown()ed-
sockets may be unidirectional; some (i.e., sockets) may not support
seeking -- but the 2nd layer may augment that. for example, the
BufferingLayer may add seeking (it already supports unreading).

that's why streams are queriable -- iostack has a layered structure
that allows each layer to add more functionality to the underlying
layer. in other words, all stream are NOT born equal, but they can
be made equal later :)

that way, when your function accepts a stream as an argument,
it would just check s.readable or s.seekable, without regard to the
*type* of s itself, or the underlying storage --

it may be a file, it may be a buffered socket, but as long as you can
read from it/seek in it,  your code would work just fine. kinda like
duck-typing.

> FileBytes would support the
> sequence protocol, mimicking bytes objects.  It would support
> random-access read and write using __getitem__ and __setitem__,
> allowing slice assignment for slices of equal size.

this may be a good direction. i'll try to see how it fits in.


-tomer

On 9/3/06, Anders J. Munch <2006 at jmunch.dk> wrote:
> tomer filiba wrote:
>  > my solution would be completely leaving seek() and tell() out of the
>  > 3rd layer -- it's a byte-level operation.
>  >
>  > anyone thinks differently? if so, what's your solution?
>
> seek and tell are a poor mans sequence.  I would have nothing by those
> names.
>
> I would have input streams, output streams and sequences, and I
> wouldn't mix the three.  FileReader would be an InputStream,
> FileWriter would be an OutputStream.  FileBytes would support the
> sequence protocol, mimicking bytes objects.  It would support
> random-access read and write using __getitem__ and __setitem__,
> allowing slice assignment for slices of equal size.  And there would
> be append() to extend the file, and partial __delitem__ support for
> truncating.
>
> Looking at your iostack2 Stream class, no sooner do you introduce the
> key methods read and write, than you supplement them with capability
> queries readable and writable that check whether these methods may
> even be called.  IMO this is a clear indication that these methods
> really want to be refactored into separate classes.
>
> I think you'll find that separating input, output and random access
> into three separate ADTs will much simplify BufferingLayer (even
> though you'll need three of them).  At least if you intend to take
> interactions between reads and writes into account.
>
> regards,
> Anders
>
>

From qrczak at knm.org.pl  Sun Sep  3 22:23:23 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sun, 03 Sep 2006 22:23:23 +0200
Subject: [Python-3000] encoding hell
In-Reply-To: <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com>
	(tomer filiba's message of "Sun, 3 Sep 2006 20:17:39 +0200")
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
	<44FB0CBF.7070102@jmunch.dk>
	<1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com>
Message-ID: <87lkp0bsxw.fsf@qrnik.zagroda>

"tomer filiba" <tomerfiliba at gmail.com> writes:

>> FileReader would be an InputStream,
>> FileWriter would be an OutputStream
>
> yes, this has been discussed, but that's too java-ish by nature.
> besides, how would this model handle a simple operation, such as
> file("foo", "w+") ?

What is a rationale of this operation for a text file?

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From aahz at pythoncraft.com  Sun Sep  3 22:45:28 2006
From: aahz at pythoncraft.com (Aahz)
Date: Sun, 3 Sep 2006 13:45:28 -0700
Subject: [Python-3000] encoding hell
In-Reply-To: <87lkp0bsxw.fsf@qrnik.zagroda>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
	<44FB0CBF.7070102@jmunch.dk>
	<1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com>
	<87lkp0bsxw.fsf@qrnik.zagroda>
Message-ID: <20060903204528.GA3950@panix.com>

On Sun, Sep 03, 2006, Marcin 'Qrczak' Kowalczyk wrote:
> "tomer filiba" <tomerfiliba at gmail.com> writes:
>>
>> file("foo", "w+") ?
> 
> What is a rationale of this operation for a text file?

You want to be able to read the file and write data to it.  That argues
in favor of seek(0) and seek(-1) being the only supported behaviors,
though.
-- 
Aahz (aahz at pythoncraft.com)           <*>         http://www.pythoncraft.com/

I support the RKAB

From 2006 at jmunch.dk  Mon Sep  4 00:29:43 2006
From: 2006 at jmunch.dk (Anders J. Munch)
Date: Mon, 04 Sep 2006 00:29:43 +0200
Subject: [Python-3000] encoding hell
In-Reply-To: <1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>	
	<44FB0CBF.7070102@jmunch.dk>
	<1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com>
Message-ID: <44FB5757.6070209@jmunch.dk>

tomer filiba wrote:
 >> FileReader would be an InputStream,
 >> FileWriter would be an OutputStream
 >
 > yes, this has been discussed, but that's too java-ish by nature.
 > besides, how would this model handle a simple operation, such as
 > file("foo", "w+") ?

You mean, with the intent of both reading and writing to the file in
the same go?  That's what I meant FileBytes for.  Do you have a
requirement for drop-in compatibility with the current I/O?

In all my programming days I don't believe I written to and read from
the same file handle even once.  Use cases exist, like if you're
implementing a DBMS, or adding to a zip file in-place, but they're the
exception, and by separating that functionality out in a dedicated
class like FileBytes, you avoid having the complexities of mixed input
and output affect your typical use cases.

 > the reason is some streams, like pipes or partially shutdown()ed-
 > sockets may be unidirectional; some (i.e., sockets) may not support
 > seeking -- but the 2nd layer may augment that. for example, the
 > BufferingLayer may add seeking (it already supports unreading).

Watch out!  There's an essentiel difference between files and
bidirectional communications channels that you need to take into
account.  For a TCP connection, input and output can be seen as
isolated from one another, with each their own stream position, and
each their own contents.  For read/write files, it's a whole different
ballgame, because stream position and data are shared.

That means you cannot use the same buffering code for both cases.  For
files, whenever you write something, you need to take into account
that that may overlap your read buffer or change read position.  You
should take another look at layer.BufferingLayer with that in mind.

regards, Anders


From talin at acm.org  Mon Sep  4 01:04:34 2006
From: talin at acm.org (Talin)
Date: Sun, 03 Sep 2006 16:04:34 -0700
Subject: [Python-3000] encoding hell
In-Reply-To: <44FB5757.6070209@jmunch.dk>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>		<44FB0CBF.7070102@jmunch.dk>	<1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com>
	<44FB5757.6070209@jmunch.dk>
Message-ID: <44FB5F82.3070809@acm.org>

Anders J. Munch wrote:

> Watch out!  There's an essentiel difference between files and
> bidirectional communications channels that you need to take into
> account.  For a TCP connection, input and output can be seen as
> isolated from one another, with each their own stream position, and
> each their own contents.  For read/write files, it's a whole different
> ballgame, because stream position and data are shared.
> 
> That means you cannot use the same buffering code for both cases.  For
> files, whenever you write something, you need to take into account
> that that may overlap your read buffer or change read position.  You
> should take another look at layer.BufferingLayer with that in mind.
> 
> regards, Anders

This is a better explanation of some of the comments I was raising 
earlier: The choice of buffering strategy depends on a number of factors 
related to how the stream is going to be used, as well as the internal 
implementation of the stream. A buffering strategy that works well for a 
socket won't work very well for a DBMS.

When I stated earlier that 'the OS can do a better job of buffering than 
we can', what I meant to say was somewhat broader than that - which is 
that each layer is, in many cases, a better judge of what *kind* of 
buffering it needs than the person assembling the layers.

This doesn't mean that each layer has to implement its own buffering 
algorithm. The common buffering algorithms can be factored out into 
their own objects -- but what I'd suggest is that the choice of buffer 
algorithm not *normally* be exposed to the person constructing the io stack.

Thus, when creating a standard "line reader", instead of having the user 
call:

	fh = TextReader( Buffer( File( ... ) ) )

Instead, let the TextReader choose the kind of buffer it wants and 
supply that part automatically. There are several reasons why I think 
this would work better:

1) You can't simply stick just any buffer object in the middle there and 
expect it to work. Different buffer strategies have different 
interfaces, and trying to meld them all into one uber-interface would 
make for a very complex interface.

2) The TextReader knows perfectly well what kind of buffer it needs. 
Depending on how TextReader is implemented, it might want a serial, 
read-only buffer that allows a limited degree of look-ahead buffering so 
that it can find the line breaks. Or it might want a pair of buffers - 
one decoded, one encoded. There's no way that the user can know what 
kind of buffer to use without knowing the implementation details of 
TextReader.

3) TextReader can be optimized even more if it is allowed to 'peek' 
inside the internals of the buffer - something that would not be allowed 
  if it had to conform to calling the buffer through a standard interface.


More generally, the choice of buffer depends on the usage pattern for 
reading / writing to the file - and that usage pattern is embodied in 
the definition of "TextReader". By creating a "TextReader" object, the 
user is stating their intention to read the file a certain way, in a 
certain order, with certain performance characteristics. The choice of 
buffering derives directly from those usage patterns. So the two go hand 
in hand.

Now, I'm not saying that you can't stick additional layers in-between 
TextReader and FileStream if you want to. An example might be the 
"resync" layer that you mentioned, or a journaling layer that insures 
that all writes are recoverable. I'm merely saying that for the specific 
issue of buffering, I think that the choice of buffer type is 
complicated, and requires knowledge that might not be accessible to the 
person assembling the stack.

-- Talin

From greg.ewing at canterbury.ac.nz  Mon Sep  4 01:04:25 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Mon, 04 Sep 2006 11:04:25 +1200
Subject: [Python-3000] The future of exceptions
In-Reply-To: <bbaeab100609021834x17e818e4u3b68c15f1b4bd776@mail.gmail.com>
References: <edcbqh$63c$1@sea.gmane.org>
	<bbaeab100609021344t2e35c04w37dcebcd3060dad8@mail.gmail.com>
	<76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com>
	<bbaeab100609021834x17e818e4u3b68c15f1b4bd776@mail.gmail.com>
Message-ID: <44FB5F79.6060507@canterbury.ac.nz>

Brett Cannon wrote:

> Basically.  Memory usage goes up if you do this as it stands now.

I'm not sure I follow that. The traceback gets created anyway,
so how is it going to use more memory if it's attached to a
throwaway exception instead of kept in a sys variable?

If you keep the exception around, that would keep the
traceback too, but how often are exceptions kept for long
periods after being caught?

--
Greg

From greg.ewing at canterbury.ac.nz  Mon Sep  4 01:11:34 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Mon, 04 Sep 2006 11:11:34 +1200
Subject: [Python-3000] encoding hell
In-Reply-To: <ede6m9$c9g$1@sea.gmane.org>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
	<44FA0E59.9010302@canterbury.ac.nz> <ede6m9$c9g$1@sea.gmane.org>
Message-ID: <44FB6126.8030706@canterbury.ac.nz>

Fredrik Lundh wrote:

> that's how seek/tell works on text files in today's Python, of course. 
> if you're writing portable code, you can only seek to the beginning or 
> end of the file, or to a position returned to you by tell.

True, but with arbitrary stacks of stream-transforming
objects the value might need to be even more opaque,
since it might need to encapsulate internal states of
decoders, etc. Could be very messy.

--
Greg

From brett at python.org  Mon Sep  4 01:19:55 2006
From: brett at python.org (Brett Cannon)
Date: Sun, 3 Sep 2006 16:19:55 -0700
Subject: [Python-3000] The future of exceptions
In-Reply-To: <44FB5F79.6060507@canterbury.ac.nz>
References: <edcbqh$63c$1@sea.gmane.org>
	<bbaeab100609021344t2e35c04w37dcebcd3060dad8@mail.gmail.com>
	<76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com>
	<bbaeab100609021834x17e818e4u3b68c15f1b4bd776@mail.gmail.com>
	<44FB5F79.6060507@canterbury.ac.nz>
Message-ID: <bbaeab100609031619p48b2ac40tfcf33de5e83e5ea5@mail.gmail.com>

On 9/3/06, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
>
> Brett Cannon wrote:
>
> > Basically.  Memory usage goes up if you do this as it stands now.
>
> I'm not sure I follow that. The traceback gets created anyway,
> so how is it going to use more memory if it's attached to a
> throwaway exception instead of kept in a sys variable?


It won't.

If you keep the exception around, that would keep the
> traceback too, but how often are exceptions kept for long
> periods after being caught?


Not very, but I didn't make this argument to begin with, other people did.
It was a sticking point when the idea was first put forth.  I personally
supported adding the attributes, but people kept pushing against it.

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060903/9325b8c2/attachment.htm 

From jimjjewett at gmail.com  Mon Sep  4 01:22:18 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Sun, 3 Sep 2006 19:22:18 -0400
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <44F8FEED.9000600@gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
Message-ID: <fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>

On 9/1/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Fredrik Lundh wrote:
> > today's Python supports "locale aware" 8-bit strings ...
> > to what extent should this be supported by Python 3000 ?

> Since all strings will be Unicode by then:

>  >>> u"???".isalpha()
> True

Two followup questions, then ...

(1)  To what extent should python support files (including stdin,
stdout) in local (non-unicode) encodings?  (not at all, per-file,
settable global default?)

(2)  To what extent will strings have an opaque (or at least
on-demand) backing store, so that decoding/encoding could be delayed?
(For example, Swedish text could be stored in single-byte characters,
and only converted to standard unicode on the rare occasions when it
met strings in an incompatible encoding.)

-jJ

From jimjjewett at gmail.com  Mon Sep  4 02:57:35 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Sun, 3 Sep 2006 20:57:35 -0400
Subject: [Python-3000] The future of exceptions
In-Reply-To: <bbaeab100609031619p48b2ac40tfcf33de5e83e5ea5@mail.gmail.com>
References: <edcbqh$63c$1@sea.gmane.org>
	<bbaeab100609021344t2e35c04w37dcebcd3060dad8@mail.gmail.com>
	<76fd5acf0609021724ha1e0d06s1f362bad5595e820@mail.gmail.com>
	<bbaeab100609021834x17e818e4u3b68c15f1b4bd776@mail.gmail.com>
	<44FB5F79.6060507@canterbury.ac.nz>
	<bbaeab100609031619p48b2ac40tfcf33de5e83e5ea5@mail.gmail.com>
Message-ID: <fb6fbf560609031757i10044085md83a1d397b461fa@mail.gmail.com>

On 9/3/06, Brett Cannon <brett at python.org> wrote:
> On 9/3/06, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:

> > The traceback gets created anyway, so how
> > is it going to use more memory if it's attached to a
> > throwaway exception instead of kept in a sys variable?

> > ...  how often are exceptions kept for long
> > periods after being caught?

> It was a sticking point when the idea was first put forth.

I think people were really objecting to cyclic garbage in general.
Both the garbage collector and weak references have improved since the
original discussion.

Even today, if a StopIteration() participates in a reference cycle,
then it won't be reclaimed until the next gc run.  I'm not quite sure
which direction should be a weakref, but I think it would be
reasonable for the cycle to get broken when an catching except block
exits without reraising.

-jJ

From paul at prescod.net  Mon Sep  4 03:55:20 2006
From: paul at prescod.net (Paul Prescod)
Date: Sun, 3 Sep 2006 18:55:20 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
Message-ID: <1cb725390609031855r7258a2e9q2ce2877b45075744@mail.gmail.com>

On 9/3/06, Jim Jewett <jimjjewett at gmail.com> wrote:
>
> On 9/1/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> > Fredrik Lundh wrote:
> > > today's Python supports "locale aware" 8-bit strings ...
> > > to what extent should this be supported by Python 3000 ?
>
> > Since all strings will be Unicode by then:
>
> >  >>> u"???".isalpha()
> > True
>
> Two followup questions, then ...
>
> (1)  To what extent should python support files (including stdin,
> stdout) in local (non-unicode) encodings?  (not at all, per-file,
> settable global default?)


I presume that Python's support of these will not change from today's. I
don't think that locale changes file decoding today, nor should it. After
all, files are emailed from place to place all the time.

(2)  To what extent will strings have an opaque (or at least
> on-demand) backing store, so that decoding/encoding could be delayed?
> (For example, Swedish text could be stored in single-byte characters,
> and only converted to standard unicode on the rare occasions when it
> met strings in an incompatible encoding.)


I don't see this as particularly related to the locale issue either. It is
being discussed in other threads under the name "Polymorphic strings."
Fredrik Lundh said:

"I think just delaying decoding would take us most of the way.  the big
advantage of storage polymorphism is that you can avoid decoding and
encoding (and having to pay for the cycles and bytes needed for that) if
you don't do have to."

I believe he is working on a prototype.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060903/12c63525/attachment.htm 

From guido at python.org  Mon Sep  4 04:11:02 2006
From: guido at python.org (Guido van Rossum)
Date: Sun, 3 Sep 2006 19:11:02 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
Message-ID: <ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>

On 9/3/06, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 9/1/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> > Fredrik Lundh wrote:
> > > today's Python supports "locale aware" 8-bit strings ...
> > > to what extent should this be supported by Python 3000 ?
>
> > Since all strings will be Unicode by then:
>
> >  >>> u"???".isalpha()
> > True
>
> Two followup questions, then ...
>
> (1)  To what extent should python support files (including stdin,
> stdout) in local (non-unicode) encodings?  (not at all, per-file,
> settable global default?)

I've always said (can someone find a quote perhaps?) that there ought
to be a sensible default encoding for files (including but not limited
to stdin/out/err), perhaps influenced by personalized settings,
environment variables, the OS, etc.

> (2)  To what extent will strings have an opaque (or at least
> on-demand) backing store, so that decoding/encoding could be delayed?
> (For example, Swedish text could be stored in single-byte characters,
> and only converted to standard unicode on the rare occasions when it
> met strings in an incompatible encoding.)

That seems to be a bit of a leading question. Talin is currently
championing strings with different fixed-width storage, and others
have proposed even more flexible "polymorphic strings". You might want
to learn about the NSString type on Apple's ObjectiveC.

BTW the term "backing store" is typically used for *disk-based*
storage of large amounts of data -- but (despite that your first
question is about files) I don't believe this what you're referring
to.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From jimjjewett at gmail.com  Mon Sep  4 05:14:18 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Sun, 3 Sep 2006 23:14:18 -0400
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
Message-ID: <fb6fbf560609032014l4119842fs2af78e572d84a431@mail.gmail.com>

On 9/3/06, Guido van Rossum <guido at python.org> wrote:
> On 9/3/06, Jim Jewett <jimjjewett at gmail.com> wrote:

> > (2)  To what extent will strings have an opaque
> > (or at least on-demand) backing store, so that
> > decoding/encoding could be delayed?

> That seems to be a bit of a leading question.

Yes; I (mis-?)read the original question as asking whether non-English
users would still be able to use (faster) 8-bit representations.

> BTW the term "backing store" is typically used for
> *disk-based* storage of large amounts of data --
> but (despite that your first question is about files)
> I don't believe this what you're referring to.

You are correct; I had forgotten that meaning, and was taking my usage
from the  CFString (~= NSString) documentation suggested earlier.
There it refers to the underlying (private) real storage, rather than
to a disk.

Today, python unicode characters are limited to a specific fixed width
at compile time, because C extensions can operate directly on the data
buffer.  If C extensions were required to go through the unicode
methods -- or at least to explicitly request a buffer -- then the
underlying storage could (often) be far more efficient.

This privatization would, however, be a major change to the API.
Smaller and faster localized strings are one of the compensatory
benefits.

-jJ

From jack at psynchronous.com  Mon Sep  4 09:21:29 2006
From: jack at psynchronous.com (Jack Diederich)
Date: Mon, 4 Sep 2006 03:21:29 -0400
Subject: [Python-3000] The future of exceptions
In-Reply-To: <edcbqh$63c$1@sea.gmane.org>
References: <edcbqh$63c$1@sea.gmane.org>
Message-ID: <20060904072129.GC5707@performancedrivers.com>

On Sat, Sep 02, 2006 at 06:36:37PM +0200, Georg Brandl wrote:
> While looking at the changes necessary to implement the exception
> related syntax changes (except ... as ..., raise without type),
> I came across some more substantial things that I think must be discussed.
> 
> * How should exceptions be represented in C code? Should there still
>   be a (type, value, traceback) triple?
> 
> * Could the traceback be made an attribute of the exception?
> 
> * What about exception chaining?
> 
The last time this came up everyone's eyes glazed over and the conversation
stopped.  That doesn't mean it isn't worth talking about it just means that
exceptions are hard and potentially make GC miserable.

> Something like this comes to mind::
> 
>     try:
>         whatever
>     except ValueError as err:
>         raise CustomException("Something went wrong", prev=err)
> 
> With tracebacks becoming part of the exception, that could be::
> 
>     raise CustomException(*args, prev=err, tb=traceback)
> 
> (`prev` and `tb` would be keyword-only arguments)
> 
> With that, all exception info would be contained in one object,
> so sys.exc_info() could be renamed to sys.last_exc().
> 

The current system is awkward if you want to do fancy things with
exceptions and tracebacks.  I've never had to do fancy things with
exceptions and tracebacks so I'm OK with it.  "raise" as a bare word
covers all the cases where I need to catch, inspect, and potentially
reraise the original.  In the above example you are just annotating
and reraising an error so a KISS suggestion might go

try:
  whatever
except ValueError as err:
  err.also_squawk += 'Kilroy was here'
  raise

Where 'also_squawk' was renamed to something more intuitive and much
more international.

-Jack
 

From phd at mail2.phd.pp.ru  Mon Sep  4 12:24:13 2006
From: phd at mail2.phd.pp.ru (Oleg Broytmann)
Date: Mon, 4 Sep 2006 14:24:13 +0400
Subject: [Python-3000] encoding hell
In-Reply-To: <20060903204528.GA3950@panix.com>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
	<44FB0CBF.7070102@jmunch.dk>
	<1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com>
	<87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com>
Message-ID: <20060904102413.GC21049@phd.pp.ru>

On Sun, Sep 03, 2006 at 01:45:28PM -0700, Aahz wrote:
> On Sun, Sep 03, 2006, Marcin 'Qrczak' Kowalczyk wrote:
> > "tomer filiba" <tomerfiliba at gmail.com> writes:
> >>
> >> file("foo", "w+") ?
> > 
> > What is a rationale of this operation for a text file?
> 
> You want to be able to read the file and write data to it.  That argues
> in favor of seek(0) and seek(-1) being the only supported behaviors,
> though.

   Sometimes programs need tell() + seek(). Two examples (very similar,
really).

   Example 1. I have a program, an email robot that receives email(s) and
marks email addresses in a "database" that is actually a text file:

--- email database file ---
 phd at phd.pp.ru
 phd at oper.med.ru
--- / ---

   The program opens the file in "r+" mode, reads it line by line and
stores the positions of the first character in an every line using tell().
When it needs to mark an email it seek()'s to the stored position and write
'+' mark so the file looks like

--- email database file ---
+phd at phd.pp.ru
 phd at oper.med.ru
--- / ---


   Example 2. INN (the NNTP daemon) stores (at least stored when I was
using it) information about newsgroup in a text file database. It uses
another approach - it stores info using lines of equal length:

--- newsgroups ---
comp.lang.python                          000001234567
comp.lang.python.announce                 000000abcdef
--- / ---

   Probably INN doesn't use tell() - it just calculates the position using
line length. But a python program needs tell() and seek() for such a file.

Oleg.
-- 
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From aahz at pythoncraft.com  Mon Sep  4 15:39:52 2006
From: aahz at pythoncraft.com (Aahz)
Date: Mon, 4 Sep 2006 06:39:52 -0700
Subject: [Python-3000] encoding hell
In-Reply-To: <20060904102413.GC21049@phd.pp.ru>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
	<44FB0CBF.7070102@jmunch.dk>
	<1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com>
	<87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com>
	<20060904102413.GC21049@phd.pp.ru>
Message-ID: <20060904133951.GA10810@panix.com>

On Mon, Sep 04, 2006, Oleg Broytmann wrote:
> On Sun, Sep 03, 2006 at 01:45:28PM -0700, Aahz wrote:
>> 
>> You want to be able to read the file and write data to it.  That argues
>> in favor of seek(0) and seek(-1) being the only supported behaviors,
>> though.
> 
>    Sometimes programs need tell() + seek(). Two examples (very similar,
> really).
> 
>    Example 1. I have a program, an email robot that receives email(s) and
> marks email addresses in a "database" that is actually a text file:

[snip examples of file with email addresses and INN control files]

My understanding is that those are in fact binary files that are being
treated as line-oriented "text" files.  I would agree that there needs
to be a way to do line-oriented processing on binary files, but anyone
who attempts to process these as text files is foolish at best.
-- 
Aahz (aahz at pythoncraft.com)           <*>         http://www.pythoncraft.com/

I support the RKAB

From david.nospam.hopwood at blueyonder.co.uk  Mon Sep  4 17:50:51 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Mon, 04 Sep 2006 16:50:51 +0100
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org>
	<44F8FEED.9000600@gmail.com>	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
Message-ID: <44FC4B5B.9010508@blueyonder.co.uk>

Guido van Rossum wrote:
> On 9/3/06, Jim Jewett <jimjjewett at gmail.com> wrote:
> 
>>Two followup questions, then ...
>>
>>(1)  To what extent should python support files (including stdin,
>>stdout) in local (non-unicode) encodings?  (not at all, per-file,
>>settable global default?)

Per-file, I hope.

> I've always said (can someone find a quote perhaps?) that there ought
> to be a sensible default encoding for files (including but not limited
> to stdin/out/err), perhaps influenced by personalized settings,
> environment variables, the OS, etc.

While it should be possible to find out what the OS believes to be
the current "system" charset (GetCPInfoEx(CP_ACP, ...) on Windows;
LC_CHARSET environment variable on Unix), that does not mean that it
is this charset that Python programs should normally use. When defining
a new text-based file type, it is simpler to define it to be always UTF-8.

>>(2)  To what extent will strings have an opaque (or at least
>>on-demand) backing store, so that decoding/encoding could be delayed?
>>(For example, Swedish text could be stored in single-byte characters,
>>and only converted to standard unicode on the rare occasions when it
>>met strings in an incompatible encoding.)
> 
> That seems to be a bit of a leading question. Talin is currently
> championing strings with different fixed-width storage, and others
> have proposed even more flexible "polymorphic strings". You might want
> to learn about the NSString type on Apple's ObjectiveC.

Operating on encoded constant strings, and decoding each character on the
fly, works fine when the charset is stateless and each character has a 1-1
correspondance with a Unicode character (i.e. code point). In that case
the program can operate on the string essentially as if it were Unicode.
It still works fine for variable-width charsets (including UTF-8 and
UTF-16); that just means that the program has to avoid assuming that a
position in the string is the same thing as a character count.

For charsets like ISCII and ISO 2022, which are stateful and/or have
a different encoding model to Unicode, I don't believe this approach
would work very well. But it is fine to support this for some charsets
and not others.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From guido at python.org  Mon Sep  4 23:32:12 2006
From: guido at python.org (Guido van Rossum)
Date: Mon, 4 Sep 2006 14:32:12 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <44FC4B5B.9010508@blueyonder.co.uk>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
Message-ID: <ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>

On 9/4/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Guido van Rossum wrote:
> > I've always said (can someone find a quote perhaps?) that there ought
> > to be a sensible default encoding for files (including but not limited
> > to stdin/out/err), perhaps influenced by personalized settings,
> > environment variables, the OS, etc.
>
> While it should be possible to find out what the OS believes to be
> the current "system" charset (GetCPInfoEx(CP_ACP, ...) on Windows;
> LC_CHARSET environment variable on Unix), that does not mean that it
> is this charset that Python programs should normally use. When defining
> a new text-based file type, it is simpler to define it to be always UTF-8.

In this particular case I don't care what's simpler to implement, but
what's most likely to do what the user expects. If on a particular box
most files are encoded in encoding X, and the user did whatever is
necessary to tell the tools that that's their preferred encoding, I
want Python to honor that encoding when opening text files, unless the
program makes other arrangements explicitly (such as specifying an
explicit encoding as a parameter to open()).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rasky at develer.com  Tue Sep  5 15:17:49 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Tue, 5 Sep 2006 15:17:49 +0200
Subject: [Python-3000] have zip() raise exception for sequences
	of	different lengths
References: <d11dcfba0608301440u34f00311x714d3c1fe94f699a@mail.gmail.com>	<44F608B6.5010209@ewtllc.com><d11dcfba0608301656kb599177t548e25e098de3c47@mail.gmail.com>
	<44F62745.60006@ewtllc.com>
Message-ID: <017401c6d0ed$af903d10$b803030a@trilan>

Raymond Hettinger wrote:

> It's a PITA because it precludes all of the use cases whether the
> inputs ARE intentionally of different length (like when one argument
> supplys an infinite iterator):
>
>    for lineno, ts, line in zip(count(1), timestamp(), sys.stdin):
>        print 'Line %d, Time %s:  %s)' % (lineno, ts, line)

which is a much more complicated way of writing:

for lineno, line in enumerate(sys.stdin):
    ts = time.time()
    ...

[assuming your "timestamp()" is what I think it is, never heard of it
before].

I double-checked my own uses of zip() and they seem to follow the trend of
those in Python stdlib: most of the cases are really programming errors if
the two sequences do not match in length. I reckon the usage of infinite
iterators is generally much less common.
-- 
Giovanni Bajo


From paul at prescod.net  Tue Sep  5 18:08:47 2006
From: paul at prescod.net (Paul Prescod)
Date: Tue, 5 Sep 2006 09:08:47 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
Message-ID: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>

On 9/4/06, Guido van Rossum <guido at python.org> wrote:
>
> In this particular case I don't care what's simpler to implement, but
> what's most likely to do what the user expects. If on a particular box
> most files are encoded in encoding X, and the user did whatever is
> necessary to tell the tools that that's their preferred encoding, I
> want Python to honor that encoding when opening text files, unless the
> program makes other arrangements explicitly (such as specifying an
> explicit encoding as a parameter to open()).


It does not strike me as accurate that on a modern computer system, a
Swedish person's computer is full of ISO/Swedish encoded files and a Chinese
person's computer is full of a speciifc Chinese encoding etc. Maybe that was
true before the notion of variant encodings became so popular.

But now Europeans are just as likely to use UTF-8 as a national encoding and
Asians each have MANY different encodings to select from (some defined by
Unicode, some national). I doubt you'll frequently guess correctly except in
specialized apps where a user has very explicit control over their file
encodings and doesn't depend on applications to choose. The direction over
the lifetype of Python 3000 will be AWAY from national, local,
locale-predictable encodings and TOWARDS global, standard encodings. Once we
get to a place where Unicode encodings are dominant, a local-encodings
feature will be useless. In the transition period, it will be actually
harmful.

Also, only a portion of the text data on a computer is in "documents" where
the end-user has control over the encoding. There are also  many, many
configuration files, emails, saved web pages, chat logs etc. where the
encoding was selected by someone else with a potentially different
nationality.

I would guess that "most" text files on "most" computers in any particular
locale are in ASCII/utf-8. Japanese people also have hosts files and
.htaccess files and INI files and log files and ... Python can't know
whether it is dealing with one of these files or an end-user document.

Of the subset of documents that actually have their encoding controlled by
the local user's preferences, an increasing portion with be XML and XML
documents describe their encoding explicitly. It would be wrong to use the
locale to override that.

Beyond all of that: It just seems wrong to me that I could send someone a
bunch of files and a Python program and their results processing them would
be different from mine, despite the fact that we run the same version of
Python on the same operating system.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/18f98882/attachment.html 

From jimjjewett at gmail.com  Tue Sep  5 18:35:35 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 5 Sep 2006 12:35:35 -0400
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
Message-ID: <fb6fbf560609050935j4f6ba898j68ba7a827534d0b9@mail.gmail.com>

On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> On 9/4/06, Guido van Rossum <guido at python.org> wrote:

> > In this particular case I don't care what's simpler to implement, but
> > what's most likely to do what the user expects.

Good.

> But now Europeans are just as likely to use UTF-8 as a national encoding

fine; then that will be the locale.

> and Asians each have MANY different encodings to select from (some defined by
> Unicode, some national).

and the one they typically use will be the locale.

If notepad (or vi/emacs/less/cat) agree on what a text file is, and
python doesn't, it is python that will lose.

>The direction over
> the lifetype of Python 3000 will be AWAY from national, local,
> locale-predictable encodings and TOWARDS global, standard encodings.

Ruby is not wedding itself to unicode precisely because they have seen
the opposite in Japan.  It sounded like the "unicode doesn't quite
work" problem will be permanent, because there are fundamental
differences over which glyphs should be unified when.  It isn't just a
matter of using a larger set; there are glyphs which should be unified
in some contexts but not others.

> Also, only a portion of the text data on a computer is in "documents" where
> the end-user has control over the encoding. There are also  many, many
> configuration files, emails, saved web pages, chat logs etc. where the
> encoding was selected by someone else with a potentially different
> nationality.

Typically, these either list the encoding explicitly, or stick to
something close to ASCII, which is included in most national
encodings.

> Beyond all of that: It just seems wrong to me that I could send someone a
> bunch of files and a Python program and their results processing them would
> be different from mine, despite the fact that we run the same version of
> Python on the same operating system.

So include the charset header.

-jJ

From guido at python.org  Tue Sep  5 18:52:59 2006
From: guido at python.org (Guido van Rossum)
Date: Tue, 5 Sep 2006 09:52:59 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
Message-ID: <ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>

On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> Beyond all of that: It just seems wrong to me that I could send someone a
> bunch of files and a Python program and their results processing them would
> be different from mine, despite the fact that we run the same version of
> Python on the same operating system.

And it seems just as wrong if Python doesn't do what the user expects.
If I were a beginning Python user, I'd hate it if I had prepared a
simple data file in vi or notepad and my Python program wouldn't read
it right because Python's idea of encoding differs from my editor's.

Sorry Paul, I appreciate your standards-driven perspective, but in
this area I'd rather build in more flexibility than strictly needed,
than too little. If it turns out that on a particular platform all
files are in UTF-8, making Python *on that platform* always choose
UTF-8 is simple enough. OTOH, if on a particular platform UTF-8 is
*not* the norm, Python should not insist on using it anyway. We can
remove this feature once everybody uses UTF-8. I don't believe we're
there yet, and "it just seems wrong" doesn't count as proof. :-)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From g.brandl at gmx.net  Tue Sep  5 19:03:32 2006
From: g.brandl at gmx.net (Georg Brandl)
Date: Tue, 05 Sep 2006 19:03:32 +0200
Subject: [Python-3000] have zip() raise exception for sequences of
	different lengths
In-Reply-To: <017401c6d0ed$af903d10$b803030a@trilan>
References: <d11dcfba0608301440u34f00311x714d3c1fe94f699a@mail.gmail.com>	<44F608B6.5010209@ewtllc.com><d11dcfba0608301656kb599177t548e25e098de3c47@mail.gmail.com>	<44F62745.60006@ewtllc.com>
	<017401c6d0ed$af903d10$b803030a@trilan>
Message-ID: <edkal4$7b0$1@sea.gmane.org>

Giovanni Bajo wrote:
> Raymond Hettinger wrote:
> 
>> It's a PITA because it precludes all of the use cases whether the
>> inputs ARE intentionally of different length (like when one argument
>> supplys an infinite iterator):
>>
>>    for lineno, ts, line in zip(count(1), timestamp(), sys.stdin):
>>        print 'Line %d, Time %s:  %s)' % (lineno, ts, line)
> 
> which is a much more complicated way of writing:
> 
> for lineno, line in enumerate(sys.stdin):
>     ts = time.time()
>     ...

enumerate() starts at 0, count(1) at 1, so you'd have to do a

   lineno += 1

in the body too.

Whether

   for lineno, ts, line in zip(count(1), timestamp(), sys.stdin):

is more complicated than

   for lineno, line in enumerate(sys.stdin):
       ts = time.time()
       lineno += 1

is a stylistic question.

(However, enumerate() could grow a second argument specifying the
starting index).

Georg


From brian at sweetapp.com  Tue Sep  5 20:12:15 2006
From: brian at sweetapp.com (Brian Quinlan)
Date: Tue, 05 Sep 2006 20:12:15 +0200
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org>
	<44F8FEED.9000600@gmail.com>	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	<44FC4B5B.9010508@blueyonder.co.uk>	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
Message-ID: <44FDBDFF.7090505@sweetapp.com>

Guido van Rossum wrote:
> And it seems just as wrong if Python doesn't do what the user expects.
> If I were a beginning Python user, I'd hate it if I had prepared a
> simple data file in vi or notepad and my Python program wouldn't read
> it right because Python's idea of encoding differs from my editor's.

As a user, I don't have any expectations regarding non-ASCII text files.

I'm using a US-English version of Windows XP (very common) and I haven't 
changed the default encoding (very common). Python claims that my system 
encoding is CP436 (from sys.stdin/stdout.encoding). I can assure you 
that most of the documents that I work with are not in CP436 - they are 
a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that 
this is true of many Windows XP (US-English) users. So, for me and users 
like me, Python is going to silently misinterpret my data.

How about using ASCII as the default encoding and raising an exception 
if non-ASCII text is encountered?

Cheers,
Brian

From guido at python.org  Tue Sep  5 21:13:46 2006
From: guido at python.org (Guido van Rossum)
Date: Tue, 5 Sep 2006 12:13:46 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <44FDBDFF.7090505@sweetapp.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FDBDFF.7090505@sweetapp.com>
Message-ID: <ca471dc20609051213g7addc8f8y81ac22f0dae10073@mail.gmail.com>

On 9/5/06, Brian Quinlan <brian at sweetapp.com> wrote:
> Guido van Rossum wrote:
> > And it seems just as wrong if Python doesn't do what the user expects.
> > If I were a beginning Python user, I'd hate it if I had prepared a
> > simple data file in vi or notepad and my Python program wouldn't read
> > it right because Python's idea of encoding differs from my editor's.
>
> As a user, I don't have any expectations regarding non-ASCII text files.

What tools do you use to edit or view those files? How do those tools
know the encoding to use?

(Auto-detection from sniffing the data is a perfectly valid answer BTW
-- I see no reason why that couldn't be one option, as long as there's
a way to disable it.)

> I'm using a US-English version of Windows XP (very common) and I haven't
> changed the default encoding (very common). Python claims that my system
> encoding is CP436 (from sys.stdin/stdout.encoding). I can assure you
> that most of the documents that I work with are not in CP436 - they are
> a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that
> this is true of many Windows XP (US-English) users. So, for me and users
> like me, Python is going to silently misinterpret my data.

Not to any greater extent than Notepad or whatever other tool you are using.

> How about using ASCII as the default encoding and raising an exception
> if non-ASCII text is encountered?

That would not be doing what the user wants. We have extensive
experience with defaulting to ASCII in Python 2.x and it's mostly bad.
There should definitely be a way to force ASCII as the default
encoding (if only as a debugging aid), both in the program code and in
the environment; but it shouldn't be the only default. There should
also be a way to force UTF-8 as the default, or ISO-8859-1. But if
CP436 is the default encoding set by the OS I don't see why Python
shouldn't use that as the default *in the absence of any other
preferences*.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From paul at prescod.net  Tue Sep  5 22:17:47 2006
From: paul at prescod.net (Paul Prescod)
Date: Tue, 5 Sep 2006 13:17:47 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
Message-ID: <1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com>

On 9/5/06, Guido van Rossum <guido at python.org> wrote:
>
> On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> > Beyond all of that: It just seems wrong to me that I could send someone
> a
> > bunch of files and a Python program and their results processing them
> would
> > be different from mine, despite the fact that we run the same version of
> > Python on the same operating system.
>
> And it seems just as wrong if Python doesn't do what the user expects.
> If I were a beginning Python user, I'd hate it if I had prepared a
> simple data file in vi or notepad and my Python program wouldn't read
> it right because Python's idea of encoding differs from my editor's.


My point is that most textual content in the world is NOT produced in vi or
notepad or other applications that read the system encoding. Most content is
produced in Word (future Word files will be zipped Unicode, not opaque
binary), OpenOffice, DreamWeaver, web services, gmail, Thunderbird, phpbb,
etc.

I haven't created locale-relevant content in a generic text editor in a
very, very long time.

Applications like vi and emacs that "help" you to create content that other
people can't consume are not really helping at all. After all, we (now!)
live in a networked era and people don't just create documents and then
print them out on their local printers. Most of the time when I use text
editors I am editing HTML, XML or Python and using the default of CP437 is
wrong for all of those.

Even Python will puke if you take a naive approach to text encodings in
creating a Python program.

sys:1: DeprecationWarning: Non-ASCII character '\xe0' in file
c:\temp\testencoding.py on line 1, but no encoding declared; see
http://www.python.org/peps/pep-0263.html for details

Are you going to change the Python interpreter so that it will "just work"
with content created in vi and notepad? Otherwise you're saying that Python
will take a modern collaboration-roeitend approach to text processing but
encourage Python programmers to take a naive obsolete approach.

It also isn't just a question of flexibility. I think that Brian Quinlan
made the good point that most English Windows users do not know what
encoding their computer is using. If this represents 25% of the world's
Python users, and these users run into UTF-8 data more often than CP437 then
Python will guess wrong more often than it will guess right for 25% of its
users. This is really dangerous because CP437 will happily read and munge
UTF-8 (or even UCS-2 or binary) data. This makes CP437 a terrible default
for that 25%.

But it's worse than even that. GUI applications on Windows use a different
encoding than command line ones. So on the same box, Python-in-Tk and
Python-on-command line will answer that the system encoding is "cp437"
versus "cp1252". I just tested it.

http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx

Were it not for these issue I would say that it "isn't a big deal" because
modern Linux distributions are moving to UTF-8 default anyhow, and the Mac
seems to use ASCII. So we're moving to international standards regardless.
But default encoding on Windows is totally broken.

The Mac is not totally consistent either. The console decodes UTF-8 for
display. Textedit and vim munge the display in different ways (same GUI
versus command-line issue again, I guess)

A question: what happens when Python is reading data from a socket or other
file-like object? Will that data also be decoded as if it came from the
user's locale?

I don't think that this discussion really has anything to do with being
compatible with "most of the files on a computer". It is about being
compatible with a certain set of Unix text processing applications.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/064cd1a7/attachment-0001.html 

From paul at prescod.net  Tue Sep  5 22:21:25 2006
From: paul at prescod.net (Paul Prescod)
Date: Tue, 5 Sep 2006 13:21:25 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ca471dc20609051213g7addc8f8y81ac22f0dae10073@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FDBDFF.7090505@sweetapp.com>
	<ca471dc20609051213g7addc8f8y81ac22f0dae10073@mail.gmail.com>
Message-ID: <1cb725390609051321i518d7b4cm607cbae361a55d7d@mail.gmail.com>

On 9/5/06, Guido van Rossum <guido at python.org> wrote:
>
> > So, for me and users
> > like me, Python is going to silently misinterpret my data.
>
> Not to any greater extent than Notepad or whatever other tool you are
> using.


Yes. Unicode was invented in large part because people got sick of crappy
tools that silently misintepreted their data. "I see a Euro character here,
a happy face there, a stack trace in a third place and my friend says he
sees an accented character." Not only do we not want to emulate that (PEP
263 explicitly chooses not to), we don't want to encourage other programmers
to do so either.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/18dd86ff/attachment.htm 

From guido at python.org  Tue Sep  5 22:48:27 2006
From: guido at python.org (Guido van Rossum)
Date: Tue, 5 Sep 2006 13:48:27 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com>
Message-ID: <ca471dc20609051348l2329c611xcd4d30868a9bcd03@mail.gmail.com>

I have no desire to continue this discussion in every detail. I
believe we've both made our point, eloquently enough. The designers of
the I/O library will have to come up with the specific rules for
deciding on the default encoding. The only thing I'm saying is that
hardcoding the default encoding in the language standard (like we did
for str<-->unicode in 2.0) would be a mistake. I'm trusting that
building the more basic facilities (such as being able to pass an
explicit encoding to open()) first will enable us to experiment with
different ways of determining a default encoding. That makes more
sense to me than trying to settle this argument by raising our voices.
(And yes, I am building in the possibility that I'm wrong. But
he-said-she-said won't convince me; only actual usage experience.)

--Guido

On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> On 9/5/06, Guido van Rossum <guido at python.org> wrote:
>
> > On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> > > Beyond all of that: It just seems wrong to me that I could send someone
> a
> > > bunch of files and a Python program and their results processing them
> would
> > > be different from mine, despite the fact that we run the same version of
> > > Python on the same operating system.
> >
> > And it seems just as wrong if Python doesn't do what the user expects.
> > If I were a beginning Python user, I'd hate it if I had prepared a
> > simple data file in vi or notepad and my Python program wouldn't read
> > it right because Python's idea of encoding differs from my editor's.
>
>
> My point is that most textual content in the world is NOT produced in vi or
> notepad or other applications that read the system encoding. Most content is
> produced in Word (future Word files will be zipped Unicode, not opaque
> binary), OpenOffice, DreamWeaver, web services, gmail, Thunderbird, phpbb,
> etc.
>
> I haven't created locale-relevant content in a generic text editor in a
> very, very long time.
>
> Applications like vi and emacs that "help" you to create content that other
> people can't consume are not really helping at all. After all, we (now!)
> live in a networked era and people don't just create documents and then
> print them out on their local printers. Most of the time when I use text
> editors I am editing HTML, XML or Python and using the default of CP437 is
> wrong for all of those.
>
> Even Python will puke if you take a naive approach to text encodings in
> creating a Python program.
>
> sys:1: DeprecationWarning: Non-ASCII character '\xe0' in file
> c:\temp\testencoding.py on line 1, but no encoding declared; see
> http://www.python.org/peps/pep-0263.html for details
>
> Are you going to change the Python interpreter so that it will "just work"
> with content created in vi and notepad? Otherwise you're saying that Python
> will take a modern collaboration-roeitend approach to text processing but
> encourage Python programmers to take a naive obsolete approach.
>
> It also isn't just a question of flexibility. I think that Brian Quinlan
> made the good point that most English Windows users do not know what
> encoding their computer is using. If this represents 25% of the world's
> Python users, and these users run into UTF-8 data more often than CP437 then
> Python will guess wrong more often than it will guess right for 25% of its
> users. This is really dangerous because CP437 will happily read and munge
> UTF-8 (or even UCS-2 or binary) data. This makes CP437 a terrible default
> for that 25%.
>
> But it's worse than even that. GUI applications on Windows use a different
> encoding than command line ones. So on the same box, Python-in-Tk and
> Python-on-command line will answer that the system encoding is "cp437"
> versus "cp1252". I just tested it.
>
> http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx
>
> Were it not for these issue I would say that it "isn't a big deal" because
> modern Linux distributions are moving to UTF-8 default anyhow, and the Mac
> seems to use ASCII. So we're moving to international standards regardless.
> But default encoding on Windows is totally broken.
>
> The Mac is not totally consistent either. The console decodes UTF-8 for
> display. Textedit and vim munge the display in different ways (same GUI
> versus command-line issue again, I guess)
>
> A question: what happens when Python is reading data from a socket or other
> file-like object? Will that data also be decoded as if it came from the
> user's locale?
>
> I don't think that this discussion really has anything to do with being
> compatible with "most of the files on a computer". It is about being
> compatible with a certain set of Unix text processing applications.
>
>  Paul Prescod
>
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From oliphant.travis at ieee.org  Wed Sep  6 00:17:49 2006
From: oliphant.travis at ieee.org (Travis Oliphant)
Date: Tue, 05 Sep 2006 16:17:49 -0600
Subject: [Python-3000] long/int unification
In-Reply-To: <1156470595.44ee57436b03d@www.domainfactory-webmail.de>
References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de>
Message-ID: <edkt2e$bhd$1@sea.gmane.org>

martin at v.loewis.de wrote:
> Here is a quick status of the int_unification branch,
> summarizing what I did at the Google sprint in NYC.
> 
> - the int type has been dropped; the builtins int and long
>   now both refer to long type
> - all PyInt_* API is forwarded to the PyLong_* API. Little
>   changes to the C code are necessary; the most common offender
>   is PyInt_AS_LONG((PyIntObject*)v) since I completely removed
>   PyIntObject.
> - Much of the test suite passes, although it still has a number
>   of bugs.
> - There are timing tests for allocation and for addition.
>   On allocation, the current implementation is about a factor
>   of 2 slower; the integer addition is about 1.5 times slower;
>   the initial slowdowns was by a factor of 3. The pystones
>   dropped about 10% (pybench fails to run on p3yk).

What impact is this long/int unification going to have on C-based 
sub-types of the old int-type?  Will you be able to sub-class the 
integer-type in C without carrying around all the extra backage of the 
Python long?

NumPy has a scalar-type that inherits from the current int-type which 
allows it to participate in many Python optimizations.  Will the ability 
to do this disappear?

I'm just wondering about the C-side view of the int/long unification.  I 
can see benefit to the notion of integer unification, but wonder if 
strictly throwing out the small integer type on the C-level is actually 
going too far.  In NumPy, we have 10 different integer data-types 
corresponding to what can be contained in an array.  This direction was 
chosen after years of frustration of trying to fit a square peg (the 
item from the NumPy array) into a square hole (the limited Python scalar 
types).

-Travis


From guido at python.org  Wed Sep  6 01:05:22 2006
From: guido at python.org (Guido van Rossum)
Date: Tue, 5 Sep 2006 16:05:22 -0700
Subject: [Python-3000] long/int unification
In-Reply-To: <edkt2e$bhd$1@sea.gmane.org>
References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de>
	<edkt2e$bhd$1@sea.gmane.org>
Message-ID: <ca471dc20609051605v21769c5fj54085934051db4af@mail.gmail.com>

On 9/5/06, Travis Oliphant <oliphant.travis at ieee.org> wrote:
> What impact is this long/int unification going to have on C-based
> sub-types of the old int-type?  Will you be able to sub-class the
> integer-type in C without carrying around all the extra backage of the
> Python long?

This seems unlikely given that the PyInt *type* will go away (though
the PyInt *API* methods may well continue to exist). You can subclass
the PyLong type just as easily. What baggage are you thinking of?

> NumPy has a scalar-type that inherits from the current int-type which
> allows it to participate in many Python optimizations.  Will the ability
> to do this disappear?

What kind of optimizations are you thinking of?

If you're thinking of the current special-casing for e.g. list[int] in
ceval.c, that code will likely disappear (although something
equivalent will eventually be added).

See my message about premature optimization in the Py3k from about 10 days ago.

> I'm just wondering about the C-side view of the int/long unification.  I
> can see benefit to the notion of integer unification, but wonder if
> strictly throwing out the small integer type on the C-level is actually
> going too far.  In NumPy, we have 10 different integer data-types
> corresponding to what can be contained in an array.  This direction was
> chosen after years of frustration of trying to fit a square peg (the
> item from the NumPy array) into a square hole (the limited Python scalar
> types).

But now that we have __index__, of course, there's less reason to
subclass PyInt in the first place -- you can write your own 32-bit
integer *without* inheriting from PyInt or PyLong, and it should be
usable perfectly whenever an integer is expected. Id rather make sure
*this* property is provided without compromise than attempting to keep
random older optimizations alive for nostalgia's sake.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Wed Sep  6 01:33:19 2006
From: guido at python.org (Guido van Rossum)
Date: Tue, 5 Sep 2006 16:33:19 -0700
Subject: [Python-3000] long/int unification
In-Reply-To: <44FE0752.9020903@ee.byu.edu>
References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de>
	<edkt2e$bhd$1@sea.gmane.org>
	<ca471dc20609051605v21769c5fj54085934051db4af@mail.gmail.com>
	<44FE0752.9020903@ee.byu.edu>
Message-ID: <ca471dc20609051633i51c29ea9qc604f137602fd705@mail.gmail.com>

On 9/5/06, Travis Oliphant <oliphant at ee.byu.edu> wrote:
> Guido van Rossum wrote:
>
> > On 9/5/06, Travis Oliphant <oliphant.travis at ieee.org> wrote:
> >
> >> What impact is this long/int unification going to have on C-based
> >> sub-types of the old int-type?  Will you be able to sub-class the
> >> integer-type in C without carrying around all the extra backage of the
> >> Python long?
> >
> >
> > This seems unlikely given that the PyInt *type* will go away (though
> > the PyInt *API* methods may well continue to exist). You can subclass
> > the PyLong type just as easily. What baggage are you thinking of?
>
> Just the extra stuff in the C-structure needed to handle the
> arbitrary-length integer.

That's just an int length and 15-for-16-bits encoding of the actual value.

> > If you're thinking of the current special-casing for e.g. list[int] in
> > ceval.c, that code will likely disappear (although something
> > equivalent will eventually be added).
>
> Yes, that's what I'm thinking of.   It would be nice if the "something
> equivalent" could be extended to other objects.  I suppose the
> discussion can be held off until then.
>
> >
> > But now that we have __index__, of course, there's less reason to
> > subclass PyInt in the first place -- you can write your own 32-bit
> > integer *without* inheriting from PyInt or PyLong, and it should be
> > usable perfectly whenever an integer is expected. Id rather make sure
> > *this* property is provided without compromise than attempting to keep
> > random older optimizations alive for nostalgia's sake.
>
>
> Of course, I agree entirely, so I doubt it will matter at all (except in
> optimizations).  There is probably going to be an increasing need to
> tell whether or not something can handle one of these interfaces.  I
> know this was already discussed on this list, but was a decision reached
> about how to tell if something exposes a specific interface? (I think
> the relevant discussion took place under the name "callable").
>
> I see a lot of
>
> isinstance(obj, int)
>
> in scientific Python code where testing for __index__ would be more
> appropriate.

I wouldn't rip this out just yet. 'int' may become an abstract type
yet -- the int/long unification branch isn't the final word (if only
because it doesn't pass all the unit tests yet).

> Thanks for easing my mind.

You're welcome. And how's that PEP coming? :-)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From oliphant at ee.byu.edu  Wed Sep  6 01:25:06 2006
From: oliphant at ee.byu.edu (Travis Oliphant)
Date: Tue, 05 Sep 2006 17:25:06 -0600
Subject: [Python-3000] long/int unification
In-Reply-To: <ca471dc20609051605v21769c5fj54085934051db4af@mail.gmail.com>
References: <1156470595.44ee57436b03d@www.domainfactory-webmail.de>	
	<edkt2e$bhd$1@sea.gmane.org>
	<ca471dc20609051605v21769c5fj54085934051db4af@mail.gmail.com>
Message-ID: <44FE0752.9020903@ee.byu.edu>

Guido van Rossum wrote:

> On 9/5/06, Travis Oliphant <oliphant.travis at ieee.org> wrote:
>
>> What impact is this long/int unification going to have on C-based
>> sub-types of the old int-type?  Will you be able to sub-class the
>> integer-type in C without carrying around all the extra backage of the
>> Python long?
>
>
> This seems unlikely given that the PyInt *type* will go away (though
> the PyInt *API* methods may well continue to exist). You can subclass
> the PyLong type just as easily. What baggage are you thinking of?

Just the extra stuff in the C-structure needed to handle the 
arbitrary-length integer.

>
> If you're thinking of the current special-casing for e.g. list[int] in
> ceval.c, that code will likely disappear (although something
> equivalent will eventually be added).

Yes, that's what I'm thinking of.   It would be nice if the "something 
equivalent" could be extended to other objects.  I suppose the 
discussion can be held off until then.

>
> But now that we have __index__, of course, there's less reason to
> subclass PyInt in the first place -- you can write your own 32-bit
> integer *without* inheriting from PyInt or PyLong, and it should be
> usable perfectly whenever an integer is expected. Id rather make sure
> *this* property is provided without compromise than attempting to keep
> random older optimizations alive for nostalgia's sake.


Of course, I agree entirely, so I doubt it will matter at all (except in 
optimizations).  There is probably going to be an increasing need to 
tell whether or not something can handle one of these interfaces.  I 
know this was already discussed on this list, but was a decision reached 
about how to tell if something exposes a specific interface? (I think 
the relevant discussion took place under the name "callable").  

I see a lot of

isinstance(obj, int)

in scientific Python code where testing for __index__ would be more 
appropriate.

Thanks for easing my mind.

-Travis


From david.nospam.hopwood at blueyonder.co.uk  Wed Sep  6 02:32:28 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Wed, 06 Sep 2006 01:32:28 +0100
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>	
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>	
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	
	<44FC4B5B.9010508@blueyonder.co.uk>	
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>	
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
Message-ID: <44FE171C.1090101@blueyonder.co.uk>

Guido van Rossum wrote:
> On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> 
>> Beyond all of that: It just seems wrong to me that I could send someone a
>> bunch of files and a Python program and their results processing them
>> would be different from mine, despite the fact that we run the same version of
>> Python on the same operating system.
> 
> And it seems just as wrong if Python doesn't do what the user expects.
> If I were a beginning Python user, I'd hate it if I had prepared a
> simple data file in vi or notepad and my Python program wouldn't read
> it right because Python's idea of encoding differs from my editor's.

I don't know about vi, but notepad will open and save files that are not in
the system ("ANSI") encoding just fine. On opening it checks for a BOM and
auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose
"Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the
Encoding drop-down box.

This is exactly the behaviour that most users would expect of a well-behaved
Unicode-aware app. It should be as easy as possible to match this behaviour
in a Python program.

> Sorry Paul, I appreciate your standards-driven perspective, but in
> this area I'd rather build in more flexibility than strictly needed,
> than too little. If it turns out that on a particular platform all
> files are in UTF-8, making Python *on that platform* always choose
> UTF-8 is simple enough.

The problem is not the systems where all files are UTF-8, or all files are
another known charset. The problem is the platforms where half of the files
are UTF-8 and half are in some other charset, determined either by type or by
presence of a UTF-8 BOM. This is a *very* common situation, especially for
European users.

Such a user cannot set the locale to UTF-8, because that will break all of
their non-Unicode-aware applications. The Unicode-aware applications typically
have much better support for reading and writing files in charsets that are
not the system default. So in practice the locale has to be set to the "old"
charset during a migration to UTF-8.

(Setting different locales for different applications is far too much hassle.
On Windows, although I believe it is technically possible to do the equivalent
of selecting a UTF-8 locale, most users don't know how to do it, even if they
want to use UTF-8 exclusively.)

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From david.nospam.hopwood at blueyonder.co.uk  Wed Sep  6 02:36:10 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Wed, 06 Sep 2006 01:36:10 +0100
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <44FE171C.1090101@blueyonder.co.uk>
References: <ed8pd9$ch$1@sea.gmane.org>
	<44F8FEED.9000600@gmail.com>		<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>		<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>		<44FC4B5B.9010508@blueyonder.co.uk>		<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>		<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
Message-ID: <44FE17FA.6030103@blueyonder.co.uk>

David Hopwood wrote:
> I don't know about vi, but notepad will open and save files that are not in
> the system ("ANSI") encoding just fine. On opening it checks for a BOM and
> auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose
> "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the
> Encoding drop-down box.

... and it also helpfully prompts you to select a Unicode encoding, if you
forget and the file contains characters that are not representable in the ANSI
encoding.

> This is exactly the behaviour that most users would expect of a well-behaved
> Unicode-aware app. It should be as easy as possible to match this behaviour
> in a Python program.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From guido at python.org  Wed Sep  6 02:44:37 2006
From: guido at python.org (Guido van Rossum)
Date: Tue, 5 Sep 2006 17:44:37 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <44FE171C.1090101@blueyonder.co.uk>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
Message-ID: <ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>

On 9/5/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Guido van Rossum wrote:
> > On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> >
> >> Beyond all of that: It just seems wrong to me that I could send someone a
> >> bunch of files and a Python program and their results processing them
> >> would be different from mine, despite the fact that we run the same version of
> >> Python on the same operating system.
> >
> > And it seems just as wrong if Python doesn't do what the user expects.
> > If I were a beginning Python user, I'd hate it if I had prepared a
> > simple data file in vi or notepad and my Python program wouldn't read
> > it right because Python's idea of encoding differs from my editor's.
>
> I don't know about vi, but notepad will open and save files that are not in
> the system ("ANSI") encoding just fine. On opening it checks for a BOM and
> auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose
> "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the
> Encoding drop-down box.
>
> This is exactly the behaviour that most users would expect of a well-behaved
> Unicode-aware app. It should be as easy as possible to match this behaviour
> in a Python program.

And this is exactly why I want the determination of the default
encoding (i.e. the encoding to be used when opening a file when no
explicit encoding is specified by the Python code that does the
opening) to be open-ended, rather than picking some standard default
like UTF-8 and saying (like Paul seems to want to say) "this is it".

> > Sorry Paul, I appreciate your standards-driven perspective, but in
> > this area I'd rather build in more flexibility than strictly needed,
> > than too little. If it turns out that on a particular platform all
> > files are in UTF-8, making Python *on that platform* always choose
> > UTF-8 is simple enough.
>
> The problem is not the systems where all files are UTF-8, or all files are
> another known charset. The problem is the platforms where half of the files
> are UTF-8 and half are in some other charset, determined either by type or by
> presence of a UTF-8 BOM. This is a *very* common situation, especially for
> European users.

Right. (And Paul appears to be ignorant of this.)

> Such a user cannot set the locale to UTF-8, because that will break all of
> their non-Unicode-aware applications. The Unicode-aware applications typically
> have much better support for reading and writing files in charsets that are
> not the system default. So in practice the locale has to be set to the "old"
> charset during a migration to UTF-8.
>
> (Setting different locales for different applications is far too much hassle.
> On Windows, although I believe it is technically possible to do the equivalent
> of selecting a UTF-8 locale, most users don't know how to do it, even if they
> want to use UTF-8 exclusively.)

Right. Of course, "locale" and "encoding" are somewhat orthogonal
issues; the encoding may be UTF-8 but that doesn't determine other
aspects of the locale (such as language-specific collation order, or
culture-specific formatting of numbers, dates and money). Now, some
platforms may equate the two somehow, and on those platforms we would
have to inspect the locale to tell the encoding; but other platforms
may specify the encoding separate from the locale...

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From david.nospam.hopwood at blueyonder.co.uk  Wed Sep  6 02:46:29 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Wed, 06 Sep 2006 01:46:29 +0100
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ca471dc20609051213g7addc8f8y81ac22f0dae10073@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org>
	<44F8FEED.9000600@gmail.com>	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	<44FC4B5B.9010508@blueyonder.co.uk>	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>	<44FDBDFF.7090505@sweetapp.com>
	<ca471dc20609051213g7addc8f8y81ac22f0dae10073@mail.gmail.com>
Message-ID: <44FE1A65.7020900@blueyonder.co.uk>

Guido van Rossum wrote:
> On 9/5/06, Brian Quinlan <brian at sweetapp.com> wrote:
> [...]
> 
> That would not be doing what the user wants. We have extensive
> experience with defaulting to ASCII in Python 2.x and it's mostly bad.
> There should definitely be a way to force ASCII as the default
> encoding (if only as a debugging aid), both in the program code and in
> the environment; but it shouldn't be the only default. There should
> also be a way to force UTF-8 as the default, or ISO-8859-1. But if
> CP436 is the default encoding set by the OS I don't see why Python
> shouldn't use that as the default *in the absence of any other
> preferences*.

Cp436 is almost certainly *not* the encoding set by the OS; Python
has got it wrong. If Brian is using an English-language variant of
Windows XP and has not changed the defaults, the system ("ANSI")
encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1
if C1 control characters are not used).

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From guido at python.org  Wed Sep  6 03:09:21 2006
From: guido at python.org (Guido van Rossum)
Date: Tue, 5 Sep 2006 18:09:21 -0700
Subject: [Python-3000] encoding hell
In-Reply-To: <20060904102413.GC21049@phd.pp.ru>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
	<44FB0CBF.7070102@jmunch.dk>
	<1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com>
	<87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com>
	<20060904102413.GC21049@phd.pp.ru>
Message-ID: <ca471dc20609051809iae9cddcxd13b3a7c35c6c4ae@mail.gmail.com>

On 9/4/06, Oleg Broytmann <phd at oper.phd.pp.ru> wrote:
> On Sun, Sep 03, 2006 at 01:45:28PM -0700, Aahz wrote:
> > On Sun, Sep 03, 2006, Marcin 'Qrczak' Kowalczyk wrote:
> > > "tomer filiba" <tomerfiliba at gmail.com> writes:
> > >>
> > >> file("foo", "w+") ?
> > >
> > > What is a rationale of this operation for a text file?
> >
> > You want to be able to read the file and write data to it.  That argues
> > in favor of seek(0) and seek(-1) being the only supported behaviors,
> > though.

Umm, where he wrote seek(-1) he probably meant seek(0, 2) which is how
one seeks to EOF.

>    Sometimes programs need tell() + seek(). Two examples (very similar,
> really).
>
>    Example 1. I have a program, an email robot that receives email(s) and
> marks email addresses in a "database" that is actually a text file:
>
> --- email database file ---
>  phd at phd.pp.ru
>  phd at oper.med.ru
> --- / ---
>
>    The program opens the file in "r+" mode, reads it line by line and
> stores the positions of the first character in an every line using tell().
> When it needs to mark an email it seek()'s to the stored position and write
> '+' mark so the file looks like
>
> --- email database file ---
> +phd at phd.pp.ru
>  phd at oper.med.ru
> --- / ---

I don't understand how it can insert a character into the file without
rewriting everything after that point.

But it does remind me of a use case for tell+seek on a read-only text
file. An email-reading program may have a text-based multi-message
mailbox format (e.g. UNIX mailbox format) and build an in-memory index
of seek positions using a quick initial scan (or scanning as it goes).
Once it has computed the position of a message it can quickly seek to
its start and display that message.

Granted, typical mailbox formats tend to use ASCII only. But one could
easily imagine a similar use case for encoded text files containing
multiple application-specific sections.

As long as the state of the decoder is "neutral" at the start of a
line, it should be possible to do this. I like the idea that tell()
returns a "cookie" which is really a byte offset. If one wants to be
able to seek to positions with a non-neutral decoder state, the cookie
would have to be more abstract. It shouldn't matter; text apps should
not do arithmetic on seek/tell positions.

>    Example 2. INN (the NNTP daemon) stores (at least stored when I was
> using it) information about newsgroup in a text file database. It uses
> another approach - it stores info using lines of equal length:
>
> --- newsgroups ---
> comp.lang.python                          000001234567
> comp.lang.python.announce                 000000abcdef
> --- / ---
>
>    Probably INN doesn't use tell() - it just calculates the position using
> line length. But a python program needs tell() and seek() for such a file.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From david.nospam.hopwood at blueyonder.co.uk  Wed Sep  6 03:28:31 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Wed, 06 Sep 2006 02:28:31 +0100
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>	
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>	
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	
	<44FC4B5B.9010508@blueyonder.co.uk>	
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>	
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>	
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>	
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
Message-ID: <44FE243F.80203@blueyonder.co.uk>

Guido van Rossum wrote:
> On 9/5/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
>> Guido van Rossum wrote:
>> > On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
>> >
>> >> Beyond all of that: It just seems wrong to me that I could send
>> >> someone a bunch of files and a Python program and their results
>> >> processing them would be different from mine, despite the fact that
>> >> we run the same version of Python on the same operating system.
>> >
>> > And it seems just as wrong if Python doesn't do what the user expects.
>> > If I were a beginning Python user, I'd hate it if I had prepared a
>> > simple data file in vi or notepad and my Python program wouldn't read
>> > it right because Python's idea of encoding differs from my editor's.
>>
>> I don't know about vi, but notepad will open and save files that are
>> not in the system ("ANSI") encoding just fine. On opening it checks for
>> a BOM and auto-detects UTF-8 and UTF-16; on saving it will write a BOM
>> if you choose "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or
>> UTF-8 in the Encoding drop-down box.
>>
>> This is exactly the behaviour that most users would expect of a
>> well-behaved Unicode-aware app. It should be as easy as possible to
>> match this behaviour in a Python program.
> 
> And this is exactly why I want the determination of the default
> encoding (i.e. the encoding to be used when opening a file when no
> explicit encoding is specified by the Python code that does the
> opening) to be open-ended, rather than picking some standard default
> like UTF-8 and saying (like Paul seems to want to say) "this is it".

The point I was making is that the system encoding *should not* be
treated as (or called) a "default" encoding. I can't speak for Paul, but
that seemed to also be what he was saying.

The whole idea of a default encoding is flawed. Ideally there would be
no default; programmers should be forced to think about the issue
on a case-by-case basis. In some cases they might choose to open a file
with the system encoding, but that should be an explicit decision.

>> (Setting different locales for different applications is far too much
>> hassle. On Windows, although I believe it is technically possible to
>> do the equivalent of selecting a UTF-8 locale, most users don't know
>> how to do it, even if they want to use UTF-8 exclusively.)
> 
> Right. Of course, "locale" and "encoding" are somewhat orthogonal
> issues; the encoding may be UTF-8 but that doesn't determine other
> aspects of the locale (such as language-specific collation order, or
> culture-specific formatting of numbers, dates and money).

The encoding is usually an attribute of the locale. This is certainly
the case on POSIX and Windows platforms.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From paul at prescod.net  Wed Sep  6 03:53:53 2006
From: paul at prescod.net (Paul Prescod)
Date: Tue, 5 Sep 2006 18:53:53 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
Message-ID: <1cb725390609051853p59574772q16ca26d17b52c76f@mail.gmail.com>

On 9/5/06, Guido van Rossum <guido at python.org> wrote:
>
> On 9/5/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> > Guido van Rossum wrote:
> > > On 9/5/06, Paul Prescod <paul at prescod.net> wrote:
> > >
> > >> Beyond all of that: It just seems wrong to me that I could send
> someone a
> > >> bunch of files and a Python program and their results processing them
> > >> would be different from mine, despite the fact that we run the same
> version of
> > >> Python on the same operating system.
> > >
> > > And it seems just as wrong if Python doesn't do what the user expects.
> > > If I were a beginning Python user, I'd hate it if I had prepared a
> > > simple data file in vi or notepad and my Python program wouldn't read
> > > it right because Python's idea of encoding differs from my editor's.
> >
> > I don't know about vi, but notepad will open and save files that are not
> in
> > the system ("ANSI") encoding just fine. On opening it checks for a BOM
> and
> > auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you
> choose
> > "Unicode" (UTF-16LE), "Unicode big-endian" (UTF-16BE), or UTF-8 in the
> > Encoding drop-down box.
> >
> > This is exactly the behaviour that most users would expect of a
> well-behaved
> > Unicode-aware app. It should be as easy as possible to match this
> behaviour
> > in a Python program.
>
> And this is exactly why I want the determination of the default
> encoding (i.e. the encoding to be used when opening a file when no
> explicit encoding is specified by the Python code that does the
> opening) to be open-ended, rather than picking some standard default
> like UTF-8 and saying (like Paul seems to want to say) "this is it".


I never suggested that UTF-8 should be the default. In fact, I think it was
very wise of Python 2.x to make ASCII the default and I'm astounded to hear
that you regret that decision. "In the face of ambiguity, refuse the
temptation to guess."

Python 2.x provided an option to allow users to change the default
system-wide and ever since then we've (almost unanimously) counselled users
against changing it.

> > Sorry Paul, I appreciate your standards-driven perspective, but in
> > > this area I'd rather build in more flexibility than strictly needed,
> > > than too little. If it turns out that on a particular platform all
> > > files are in UTF-8, making Python *on that platform* always choose
> > > UTF-8 is simple enough.
> >
> > The problem is not the systems where all files are UTF-8, or all files
> are
> > another known charset. The problem is the platforms where half of the
> files
> > are UTF-8 and half are in some other charset, determined either by type
> or by
> > presence of a UTF-8 BOM. This is a *very* common situation, especially
> for
> > European users.
>
> Right. (And Paul appears to be ignorant of this.)


I don't see how the fact that an individual system can have half of the
files in one encoding and half in another could argue IN FAVOUR of a
system-global default. I would have thought it strengthens my argument
AGAINST trying to apply a random encoding to files.

You said:

"If on a particular box
most files are encoded in encoding X, and the user did whatever is
necessary to tell the tools that that's their preferred encoding, I
want Python to honor that encoding when opening text files, unless the
program makes other arrangements explicitly (such as specifying an
explicit encoding as a parameter to open())."

But there is no such thing that "most users do" to tell tool what's their
preferred encoding. Most users use some random (to them) operating system
default which on Windows is usually wrong and is different (for no
particular reason) on the Macintosh than on Linux. Long-time Windows users
in this thread cannot even agree what is the default for US English Windows
because there is no single default. There are two.

Can we at least agree that if LC_CHARSET is demonstrably wrong most of the
time on Windows that we should not use it (at least on Windows)?

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/dba9e8d0/attachment.html 

From paul at prescod.net  Wed Sep  6 04:00:06 2006
From: paul at prescod.net (Paul Prescod)
Date: Tue, 5 Sep 2006 19:00:06 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <44FE1A65.7020900@blueyonder.co.uk>
References: <ed8pd9$ch$1@sea.gmane.org>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FDBDFF.7090505@sweetapp.com>
	<ca471dc20609051213g7addc8f8y81ac22f0dae10073@mail.gmail.com>
	<44FE1A65.7020900@blueyonder.co.uk>
Message-ID: <1cb725390609051900ua1759feu998fd33aebb77d56@mail.gmail.com>

On 9/5/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
>
> Guido van Rossum wrote:
> > On 9/5/06, Brian Quinlan <brian at sweetapp.com> wrote:
> > [...]
> >
> > That would not be doing what the user wants. We have extensive
> > experience with defaulting to ASCII in Python 2.x and it's mostly bad.
> > There should definitely be a way to force ASCII as the default
> > encoding (if only as a debugging aid), both in the program code and in
> > the environment; but it shouldn't be the only default. There should
> > also be a way to force UTF-8 as the default, or ISO-8859-1. But if
> > CP436 is the default encoding set by the OS I don't see why Python
> > shouldn't use that as the default *in the absence of any other
> > preferences*.
>
> Cp436 is almost certainly *not* the encoding set by the OS; Python
> has got it wrong. If Brian is using an English-language variant of
> Windows XP and has not changed the defaults, the system ("ANSI")
> encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1
> if C1 control characters are not used).


http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0902/en/html/dbdaen9/00000376.htm

"There are at least two code pages in use on most PCs. Applications using
the Windows graphical user interface use the Windows code pages. These code
pages are compatible with ISO character sets, and also with ANSI character
sets. They are often referred to as *ANSI code pages*.

Character-mode applications (those using the console or command prompt
window) in Windows 95/98/Me and Windows NT/200/XP, use code pages that were
used in DOS. These are called *OEM code pages* (Original Equipment
Manufacturer) for historical reasons.

...
 Example

Consider the following situation:

   -

   A PC is running a Windows operating system with ANSI code page 1252.
   -

   The code page for character-mode applications is OEM code page 437.
   -

   Text is held in a database created using the collation UTF8.

An upper case A grave in the database is stored as hex byes C380. In a
Windows application, the same character is represented as hex CO. In a DOS
application, it is represented as hex B7."

Now notice that when we introduce Unicode (and all Python 3K strings are
Unicode), we aren't talking about DISPLAY of characters. We're talking about
INTERPRETATION of characters. So if I read a file and then merge it with
some XML data then a Windows default encoding-using application will create
different output in a Python script run from the command line versus run
from the Windows desktop. Same app. Same data. Different default encodings.
Different output.

Of course we could arbitrarily choose one of these two encodings as the
"true" one, but the fact that they are ALMOST ALWAYS inconsistent indicates
something about how likely either one is to be correct for a particular
user's goals.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060905/8be863ca/attachment.htm 

From david.nospam.hopwood at blueyonder.co.uk  Wed Sep  6 04:52:28 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Wed, 06 Sep 2006 03:52:28 +0100
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <1cb725390609051900ua1759feu998fd33aebb77d56@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org>	
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>	
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	
	<44FC4B5B.9010508@blueyonder.co.uk>	
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>	
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>	
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>	
	<44FDBDFF.7090505@sweetapp.com>	
	<ca471dc20609051213g7addc8f8y81ac22f0dae10073@mail.gmail.com>	
	<44FE1A65.7020900@blueyonder.co.uk>
	<1cb725390609051900ua1759feu998fd33aebb77d56@mail.gmail.com>
Message-ID: <44FE37EC.2050504@blueyonder.co.uk>

Paul Prescod wrote:
> On 9/5/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
>> Guido van Rossum wrote:
>> > On 9/5/06, Brian Quinlan <brian at sweetapp.com> wrote:
>> > [...]
>> >
>> > That would not be doing what the user wants. We have extensive
>> > experience with defaulting to ASCII in Python 2.x and it's mostly bad.
>> > There should definitely be a way to force ASCII as the default
>> > encoding (if only as a debugging aid), both in the program code and in
>> > the environment; but it shouldn't be the only default. There should
>> > also be a way to force UTF-8 as the default, or ISO-8859-1. But if
>> > CP436 is the default encoding set by the OS I don't see why Python
>> > shouldn't use that as the default *in the absence of any other
>> > preferences*.
>>
>> Cp436 is almost certainly *not* the encoding set by the OS; Python
>> has got it wrong. If Brian is using an English-language variant of
>> Windows XP and has not changed the defaults, the system ("ANSI")
>> encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1
>> if C1 control characters are not used).
> 
> http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0902/en/html/dbdaen9/00000376.htm
> 
> "There are at least two code pages in use on most PCs. Applications using
> the Windows graphical user interface use the Windows code pages. These code
> pages are compatible with ISO character sets, and also with ANSI character
> sets. They are often referred to as *ANSI code pages*.
> 
> Character-mode applications (those using the console or command prompt
> window) in Windows 95/98/Me and Windows NT/200/XP, use code pages that were
> used in DOS. These are called *OEM code pages* (Original Equipment
> Manufacturer) for historical reasons.

True, I oversimplified.

In practice, each text file on a Windows system is somewhat more likely to be
encoded in the ANSI charset than in the OEM charset (unless the user still
commonly uses DOS-era applications). The OEM charset only exists at all as a
compatibility hack.

> Of course we could arbitrarily choose one of these two encodings as the
> "true" one, but the fact that they are ALMOST ALWAYS inconsistent indicates
> something about how likely either one is to be correct for a particular
> user's goals.

Right -- it's impossible to make a clear distinction between "files used by
console applications" and "files used by graphical applications", since any
text file can be used by both. This just supports my assertion that there
should not be a "default" encoding at all.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From qrczak at knm.org.pl  Wed Sep  6 08:10:44 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Wed, 06 Sep 2006 08:10:44 +0200
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <44FE243F.80203@blueyonder.co.uk> (David Hopwood's message of
	"Wed, 06 Sep 2006 02:28:31 +0100")
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
	<44FE243F.80203@blueyonder.co.uk>
Message-ID: <87bqpttti3.fsf@qrnik.zagroda>

David Hopwood <david.nospam.hopwood at blueyonder.co.uk> writes:

> The whole idea of a default encoding is flawed. Ideally there would be
> no default; programmers should be forced to think about the issue
> on a case-by-case basis. In some cases they might choose to open a file
> with the system encoding, but that should be an explicit decision.

Perhaps this is shows a difference between Unix and Windows culture.

On Unix there is definitely a default encoding; this is what most good
programs operating on text files assume by default. It would be insane
to have to tell each program separately about the encoding. Locale is
the OS mechanism used to provide this information in a uniform way.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From paul at prescod.net  Wed Sep  6 12:08:21 2006
From: paul at prescod.net (Paul Prescod)
Date: Wed, 6 Sep 2006 03:08:21 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <87bqpttti3.fsf@qrnik.zagroda>
References: <ed8pd9$ch$1@sea.gmane.org>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
	<44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda>
Message-ID: <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>

On 9/5/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
> David Hopwood <david.nospam.hopwood at blueyonder.co.uk> writes:
>
> > The whole idea of a default encoding is flawed. Ideally there would be
> > no default; programmers should be forced to think about the issue
> > on a case-by-case basis. In some cases they might choose to open a file
> > with the system encoding, but that should be an explicit decision.
>
> Perhaps this is shows a difference between Unix and Windows culture.
>
> On Unix there is definitely a default encoding; this is what most good
> programs operating on text files assume by default. It would be insane
> to have to tell each program separately about the encoding. Locale is
> the OS mechanism used to provide this information in a uniform way.

Windows users do not "tell each program separately about the
encoding." The encoding varies by file type. It makes no more sense to
have a global variable that says "all of my files are Shift-JIS" than
it does to say "all of my files are PowerPoint files." Because someday
somebody is going to email you a Big-5 file (or a zipfile) and that
setting will be wrong. Once you know that a file is of type Zip then
you know that the "encoding" is zipped binary. Once you know that it
is an Office 2007 file, then you know that the encoding is Zipped XML
and that the XML will have its own encoding declaration. Once you know
that it is HTML, then you look for meta tags.

This is how real-world programs work. They shouldn't guess based on
system global variables.

May I ask an empircal question? In your experience, what percentage of
Macintosh users change the default encoding from US-ASCII to something
specific to their culture? What percentage of Ubuntu users change it
froom UTF-8 to something specific?

If the answers are "few", then we are talking about a feature that
will break Windows programs and offer little value to Unix and
Macintosh users.

If "many" users change the global system encoding on their modern Unix
distributions then I propose the following. There should be a property
called something like "encodings.recommendedEncoding". On Windows it
should be ASCII. On Unix-like platforms it can be inferred from the
locale. Programmers who know what it means and want to take advantage
of it can do so like this:

opentext(filename, "r", encoding=encodings.recommendedEncoding)

This is almost exactly how C# does it, though it uses the confusing
term "defaut encoding" which implies a default behaviour.

The lack of an encoding argument should default to ASCII or perhaps
UTF-8. (either one is relatively safe about not processing data
incorrectly by accident)

 Paul Prescod

From phd at mail2.phd.pp.ru  Wed Sep  6 12:37:51 2006
From: phd at mail2.phd.pp.ru (Oleg Broytmann)
Date: Wed, 6 Sep 2006 14:37:51 +0400
Subject: [Python-3000] encoding hell
In-Reply-To: <ca471dc20609051809iae9cddcxd13b3a7c35c6c4ae@mail.gmail.com>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>
	<44FB0CBF.7070102@jmunch.dk>
	<1d85506f0609031117h4074b7b4t2402ce665cb147f1@mail.gmail.com>
	<87lkp0bsxw.fsf@qrnik.zagroda> <20060903204528.GA3950@panix.com>
	<20060904102413.GC21049@phd.pp.ru>
	<ca471dc20609051809iae9cddcxd13b3a7c35c6c4ae@mail.gmail.com>
Message-ID: <20060906103751.GD30635@phd.pp.ru>

On Tue, Sep 05, 2006 at 06:09:21PM -0700, Guido van Rossum wrote:
> On 9/4/06, Oleg Broytmann <phd at oper.phd.pp.ru> wrote:
> >--- email database file ---
> > phd at phd.pp.ru
> > phd at oper.med.ru
> >--- / ---
> >
> >   The program opens the file in "r+" mode, reads it line by line and
> >stores the positions of the first character in an every line using tell().
> >When it needs to mark an email it seek()'s to the stored position and write
> >'+' mark so the file looks like
> >
> >--- email database file ---
> >+phd at phd.pp.ru
> > phd at oper.med.ru
> >--- / ---
> 
> I don't understand how it can insert a character into the file without
> rewriting everything after that point.

   The essential part of the program is:

results = open("results", "r+")
name, email = getaddresses([to])[0]

while 1:
   pos = results.tell()
   line = results.readline()
   if not line: break

   if line.strip() == email:
      results.seek(pos)
      results.write('+')
      break

results.close()

   Open the "database" file in "r+" mode, find the email, seek to the
beginning of the line, replace the space with '+'.

Oleg.
-- 
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From phd at mail2.phd.pp.ru  Wed Sep  6 12:48:39 2006
From: phd at mail2.phd.pp.ru (Oleg Broytmann)
Date: Wed, 6 Sep 2006 14:48:39 +0400
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>
References: <ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
	<44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda>
	<1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>
Message-ID: <20060906104839.GE30635@phd.pp.ru>

On Wed, Sep 06, 2006 at 03:08:21AM -0700, Paul Prescod wrote:
> Windows users do not "tell each program separately about the
> encoding." The encoding varies by file type. It makes no more sense to
> have a global variable that says "all of my files are Shift-JIS" than
> it does to say "all of my files are PowerPoint files." Because someday
> somebody is going to email you a Big-5 file (or a zipfile) and that
> setting will be wrong. Once you know that a file is of type Zip then
> you know that the "encoding" is zipped binary. Once you know that it
> is an Office 2007 file, then you know that the encoding is Zipped XML
> and that the XML will have its own encoding declaration. Once you know
> that it is HTML, then you look for meta tags.
> 
> This is how real-world programs work. They shouldn't guess based on
> system global variables.

   Unfortunately, the real world is a bit worse than that. There are many
protocol and file formats that cary textual information and still don't
provide a hint on encoding.
   First, there are text files. Really, there are still text files. A user
can dump a README file unto his/her personal FTP server, and the file
ususally is in the local encoding.
   MP3 tags. Real nightmare. Nobody follows the standard - tag editors
write tags in the local encoding, and mp3 players interpret them in the
local encoding.
   FTP and other dumb protocols that transfer file names in the encoding
local to the server without announcing that encoding in the metadata.

Oleg.
-- 
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From qrczak at knm.org.pl  Wed Sep  6 12:51:55 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Wed, 06 Sep 2006 12:51:55 +0200
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com> (Paul
	Prescod's message of "Wed, 6 Sep 2006 03:08:21 -0700")
References: <ed8pd9$ch$1@sea.gmane.org>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
	<44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda>
	<1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>
Message-ID: <87k64hl12s.fsf@qrnik.zagroda>

"Paul Prescod" <paul at prescod.net> writes:

> Windows users do not "tell each program separately about the
> encoding." The encoding varies by file type.

There are lots of Unix file types which are based on text files
and their encoding is not specified explicitly.

> It makes no more sense to have a global variable that says "all of
> my files are Shift-JIS" than it does to say "all of my files are
> PowerPoint files."

Not all: it's just the default for text files.

> This is how real-world programs work. They shouldn't guess based on
> system global variables.

But they do. It's a fact which is impossible to change with a decree.
There is no place, other than the locale, which would suggest which
encoding is used in /etc files, or in the contents of environment
variables, or on the terminal. You might say that it's unfortunate,
but it's true.

At most you could advocate specifying new file formats with the
encoding in mind, like XML does. This doesn't enrich existing file
formats with that information.

Of course technically these formats are just sequences of bytes,
and most programs pass non-ASCII fragments around without looking
into them deeper. But as long as one tries to treat them as natural
language text, search them case-insensitively, embed text taken from
them in HTML files, then the encoding begins to matter, and there is
a general shift among programming languages to translate it on I/O
to a common format instead of dealing with encoded text on all levels.

> May I ask an empircal question? In your experience, what percentage
> of Macintosh users change the default encoding from US-ASCII to
> something specific to their culture?

I have no experience with Macintoshes at all.

> What percentage of Ubuntu users change it froom UTF-8 to something
> specific?

Why would it matter? If most of their programs use UTF-8, and it's
specified by the locale, then fine. My system uses mostly ISO-8859-2,
and it's also fine, as long as there is a way for the program to get
that information.

If a program can't read my text files or filenames or environment
variables or program invocation arguments, while they are encoded
according to the locale, then the program is broken.

If a file is not encoded using the encoding specified by the locale,
and I don't tell the program explicitly about the encoding, then it's
not the program's fault when it can't read that.

If a language requires extra steps in order to make the locale
encoding work, then it's unhelpful. Most programmers won't bother,
and their programs will work most of the time when they test it,
assuming they use it with English texts. Such programs suddenly break
when used in a non-English speaking country.

> If the answers are "few", then we are talking about a feature that
> will break Windows programs and offer little value to Unix and
> Macintosh users.

How does it break more programs than assuming ASCII does? All
encodings suitable as a system encoding are ASCII supersets, so if
a file can't be read using the locale encoding, it can't be read
in ASCII either.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From paul at prescod.net  Wed Sep  6 12:55:04 2006
From: paul at prescod.net (Paul Prescod)
Date: Wed, 6 Sep 2006 03:55:04 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <20060906104839.GE30635@phd.pp.ru>
References: <ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
	<44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda>
	<1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>
	<20060906104839.GE30635@phd.pp.ru>
Message-ID: <1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com>

But how would a system-wide default encoding help with any of these
situations? These situations are IN FACT caused by system-wide default
encodings used by naive programmers. Python should be part of the
solution, not part of the problem.

On 9/6/06, Oleg Broytmann <phd at oper.phd.pp.ru> wrote:
> On Wed, Sep 06, 2006 at 03:08:21AM -0700, Paul Prescod wrote:
> ...
>    Unfortunately, the real world is a bit worse than that. There are many
> protocol and file formats that cary textual information and still don't
> provide a hint on encoding.
>    First, there are text files. Really, there are still text files. A user
> can dump a README file unto his/her personal FTP server, and the file
> ususally is in the local encoding.
>    MP3 tags. Real nightmare. Nobody follows the standard - tag editors
> write tags in the local encoding, and mp3 players interpret them in the
> local encoding.
>    FTP and other dumb protocols that transfer file names in the encoding
> local to the server without announcing that encoding in the metadata.
>
> Oleg.
> --
>      Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
>            Programmers don't die, they just GOSUB without RETURN.

From phd at mail2.phd.pp.ru  Wed Sep  6 13:16:43 2006
From: phd at mail2.phd.pp.ru (Oleg Broytmann)
Date: Wed, 6 Sep 2006 15:16:43 +0400
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com>
References: <ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
	<44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda>
	<1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>
	<20060906104839.GE30635@phd.pp.ru>
	<1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com>
Message-ID: <20060906111643.GA4412@phd.pp.ru>

On Wed, Sep 06, 2006 at 03:55:04AM -0700, Paul Prescod wrote:
> But how would a system-wide default encoding help with any of these
> situations? These situations are IN FACT caused by system-wide default
> encodings used by naive programmers. Python should be part of the
> solution, not part of the problem.
> 
> On 9/6/06, Oleg Broytmann <phd at oper.phd.pp.ru> wrote:
> >    First, there are text files. Really, there are still text files. A user
> > can dump a README file unto his/her personal FTP server, and the file
> > ususally is in the local encoding.
> >    MP3 tags. Real nightmare. Nobody follows the standard - tag editors
> > write tags in the local encoding, and mp3 players interpret them in the
> > local encoding.
> >    FTP and other dumb protocols that transfer file names in the encoding
> > local to the server without announcing that encoding in the metadata.

   These situations are caused because of the lack of metadata or clear
encoding-friendly standards. Ogg, for example, is encoding friendly - it
clearly states that tags (comments) must be in UTF-8, and all Ogg Vorbis
files I have saw were really in UTF-8, and all tag editors and players
write/use UTF-8. XML is encoding-friendly - every file specifies its
encoding. HTTP protocol is mostly encoding friendly with its Content-Type
header. HTML is partially encoding friendly, but only partially - if one
saves an HTML page to a file it may lack an encoding information.
   But text files and FTP protocol don't have any metadata, and ID3v2 don't
specify an universal encoding or encoding metadata. In these cases programs
can either guess encoding based on the file content or use system global
encoding.
   I fail to see how Python can help here.

Oleg.
-- 
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From brian at sweetapp.com  Wed Sep  6 13:33:43 2006
From: brian at sweetapp.com (Brian Quinlan)
Date: Wed, 06 Sep 2006 13:33:43 +0200
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <87k64hl12s.fsf@qrnik.zagroda>
References: <ed8pd9$ch$1@sea.gmane.org>	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	<44FC4B5B.9010508@blueyonder.co.uk>	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>	<44FE171C.1090101@blueyonder.co.uk>	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>	<44FE243F.80203@blueyonder.co.uk>
	<87bqpttti3.fsf@qrnik.zagroda>	<1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>
	<87k64hl12s.fsf@qrnik.zagroda>
Message-ID: <44FEB217.6040507@sweetapp.com>

Marcin 'Qrczak' Kowalczyk wrote:
> Why would it matter? If most of their programs use UTF-8, and it's
> specified by the locale, then fine. My system uses mostly ISO-8859-2,
> and it's also fine, as long as there is a way for the program to get
> that information.

The problem is that blindly using the system encoding is error prone.

For example, I would imagine that when you type:

% less /usr/lib/python2.4/getopt.py

you see "Peter ?strand" rather than "Peter ?strand".

That happens because getopt.py is encoded in ISO-8859-1 and you are 
using ISO-8859-2 as your default encoding. Maybe you don't care about 
the display glitch but there are applications where it would be a big 
deal e.g. you are populating a database based on the content of text files.

> If a program can't read my text files or filenames or environment
> variables or program invocation arguments, while they are encoded
> according to the locale, then the program is broken.

How can the program know if the file is encoded according to your 
locale? Do you think that all of the text files on your system are 
encoded using ISO-8859-2? Should Python really just guess for you?

> If a file is not encoded using the encoding specified by the locale,
> and I don't tell the program explicitly about the encoding, then it's
> not the program's fault when it can't read that.
> 
> If a language requires extra steps in order to make the locale
> encoding work, then it's unhelpful.

No, it's favoring caution and trying to avoid letting errors slip 
through. If the programmer believes that they understand the issues and 
wants to use the locale encoding setting, it will cost her <20 
characters of typing per file open to do so.

> Most programmers won't bother,
> and their programs will work most of the time when they test it,
> assuming they use it with English texts. Such programs suddenly break
> when used in a non-English speaking country.

And that is a great thing! Their program will break in a nice clean 
understandable way, instead of proceeding and generating incorrect results.

Cheers,
Brian

From murman at gmail.com  Wed Sep  6 15:18:19 2006
From: murman at gmail.com (Michael Urman)
Date: Wed, 6 Sep 2006 08:18:19 -0500
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <20060906111643.GA4412@phd.pp.ru>
References: <ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
	<44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda>
	<1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>
	<20060906104839.GE30635@phd.pp.ru>
	<1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com>
	<20060906111643.GA4412@phd.pp.ru>
Message-ID: <dcbbbb410609060618h10df3cf5r6369a4e885b8608c@mail.gmail.com>

On 9/6/06, Oleg Broytmann <phd at oper.phd.pp.ru> wrote:
>    These situations are caused because of the lack of metadata or clear
> encoding-friendly standards. Ogg, for example, is encoding friendly - it
> clearly states that tags (comments) must be in UTF-8, and all Ogg Vorbis
> files I have saw were really in UTF-8, and all tag editors and players
> write/use UTF-8.

And yet I've run across vorbiscomments encoded in latin-1. It screws
everyone else up, but there are always going to be applications that
do not play along.

> XML is encoding-friendly - every file specifies its encoding.

And plenty of people use methods to read and write it which cannot
cope with non ascii files.

> HTTP protocol is mostly encoding friendly with its Content-Type
> header. HTML is partially encoding friendly, but only partially - if one
> saves an HTML page to a file it may lack an encoding information.

Right; HTTP has the means to indicate the encoding, but rarely does it
have the means to acquire it.

>    But text files and FTP protocol don't have any metadata, and ID3v2 don't
> specify an universal encoding or encoding metadata. In these cases programs
> can either guess encoding based on the file content or use system global
> encoding.

Actually, ID3v2 offers exactly four encodings: latin1, UTF16,
UTF16-BE, and UTF8. However UTF16 isn't endian-determined, and latin1
has been abused and holds the Windows ACP encoded text more often than
not, so it's a poor indicator. Another case of applications ignoring
the spec and doing what's easy. (I don't recall exactly when the
unicode encoding options were added, so they may have had little
choice; more likely they were too lazy to use UTF16 or it wouldn't
work on their portable device.)

>    I fail to see how Python can help here.

Absolutely agreed. I suspect the best option is some sort of TextFile
constructor that defaults to ASCII (or has no default) but accepts an
easy way to use the "recommended" or system encoding, or any explicit
one. And for more complicated formats, the code will just have to use
a bytestream layer, and decode as necessary. This may be a pain for
mbox files, but unless there's a way to switch encodings on the fly, a
seemingly text file will have to be treated as binary (newlines
excepted, I hope).

I also hope that, if the "recommended" encoding uses a heuristic on
the file's contents, the file has enough data in the encoding to make
a good guess. Music metadata rarely is that. :)

Michael
-- 
Michael Urman  http://www.tortall.net/mu/blog

From david.nospam.hopwood at blueyonder.co.uk  Tue Sep  5 02:28:54 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Tue, 05 Sep 2006 01:28:54 +0100
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>	
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>	
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
Message-ID: <44FCC4C6.9030500@blueyonder.co.uk>

Guido van Rossum wrote:
> On 9/4/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
>> Guido van Rossum wrote:
>>
>> > I've always said (can someone find a quote perhaps?) that there ought
>> > to be a sensible default encoding for files (including but not limited
>> > to stdin/out/err), perhaps influenced by personalized settings,
>> > environment variables, the OS, etc.
>>
>> While it should be possible to find out what the OS believes to be
>> the current "system" charset (GetCPInfoEx(CP_ACP, ...) on Windows;
>> LC_CHARSET environment variable on Unix), that does not mean that it
>> is this charset that Python programs should normally use. When defining
>> a new text-based file type, it is simpler to define it to be always
>> UTF-8.
> 
> In this particular case I don't care what's simpler to implement,

The issue is not simplicity of implementation; it is what will provide
the simplest usage model in the long term. If new files are encoded in X
just because most of a user's existing files are encoded in X, then how is
the user supposed to migrate to a different encoding? Language specifications
can have a significant effect in helping migration to Unicode.

> but what's most likely to do what the user expects.

In practice, the system charset is often set to the charset that should
be used as a fallback *for applications that do not support Unicode*. This
is especially true on Windows systems.

Using UTF-8 by default for new file types is not only simpler, it's more
functional. If a BOM is written at the start of the file, and if the user
edits files with a text editor that recognizes this, then everything,
including writing text in multiple scripts, will Just Work from the user's
point of view.

> If on a particular box
> most files are encoded in encoding X, and the user did whatever is
> necessary to tell the tools that that's their preferred encoding, I
> want Python to honor that encoding when opening text files, unless the
> program makes other arrangements explicitly (such as specifying an
> explicit encoding as a parameter to open()).

I would prefer that there is no default. But since that is incompatible
with the existing API for open(), I accept that I'm not likely to win
that argument.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From jimjjewett at gmail.com  Wed Sep  6 18:50:24 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 6 Sep 2006 12:50:24 -0400
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <44FCC4C6.9030500@blueyonder.co.uk>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<44FCC4C6.9030500@blueyonder.co.uk>
Message-ID: <fb6fbf560609060950r3e0634cfsd32385fb902c6bb3@mail.gmail.com>

On 9/4/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:

> The issue is not simplicity of implementation; it is what will provide
> the simplest usage model in the long term. If new files are encoded in X
> just because most of a user's existing files are encoded in X, then how is
> the user supposed to migrate to a different encoding? ...

> In practice, the system charset is often set to the charset that should
> be used as a fallback *for applications that do not support Unicode*.

Are you assuming that most uses of open will be for new files, *and*
that these files will not also be read by such unicode-ignorant
applications?

Since we're only talking about text files that do not have an explicit
encoding, I can barely imagine *either* of these conditions being
true.

-jJ

From paul at prescod.net  Wed Sep  6 19:15:44 2006
From: paul at prescod.net (Paul Prescod)
Date: Wed, 6 Sep 2006 10:15:44 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <87k64hl12s.fsf@qrnik.zagroda>
References: <ed8pd9$ch$1@sea.gmane.org>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
	<44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda>
	<1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>
	<87k64hl12s.fsf@qrnik.zagroda>
Message-ID: <1cb725390609061015h1c953b7l765e42cacdff2a71@mail.gmail.com>

On 9/6/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
> "Paul Prescod" <paul at prescod.net> writes:
>
> > Windows users do not "tell each program separately about the
> > encoding." The encoding varies by file type.
>
> There are lots of Unix file types which are based on text files
> and their encoding is not specified explicitly.

Of course. But you asserted that the Windows world was insane and I
made the point that it is not. They've just consciously and explicitly
moved away from the situation where the encoding is inferred from the
environment instead of from the file's context. I'm not starting a
Windows versus Unix debate. I'm talking about the direction that the
world is working.

Python need not move forward in that direction but it should not move
backwards.Today, Python does not use the locale in inferring a file's
type. Python also explicitly chose not to use the locale in inferring
string encodings when Unicode was added.

I'm not saying that Python programmers should be disallowed from using
the system locale. I'm saying that Python itself should "resist the
urge to guess" encodings. Python programmers who want to guess could
have an easy, one-line way, as C# programmers do.

> But they do. It's a fact which is impossible to change with a
> decree.

I'm not trying to change tools. I'm asking that Python not emulate
their broken behaviour. If a Python programmer wants to do so, then
they should add one line of code.

> > What percentage of Ubuntu users change it froom UTF-8 to something
> > specific?
>
> Why would it matter?

I said explicitly why it matters in my first program. If most Unix
uses just accept system defaults then the feature is of no value to
them. If the feature actively hurts Windows programmers. So you have
decreasing value on one side and a steady amount of pain on the other.

> If a program can't read my text files or filenames or environment
> variables or program invocation arguments, while they are encoded
> according to the locale, then the program is broken.

Either you are saying that Python is broken today, or you are saying
that Python should allow people to write programs that are "not
broken" according to your definition. In the former case, I disagree.
In the latter case, I agree. The only thing we could disagree on is
whether Python's default behaviour should be to guess the encodings
based upon locale, despite Python's long history of avoiding guessing
in general and guessing encodings in particular.

>...
> If a language requires extra steps in order to make the locale
> encoding work, then it's unhelpful. Most programmers won't bother,
> and their programs will work most of the time when they test it,
> assuming they use it with English texts. Such programs suddenly break
> when used in a non-English speaking country.

Loudly and suddenly breaking is better than silently munging data.
There are vast application classes where using the system encoding is
the wrong thing. For example, an FTP server. An application working
with data from a remote socket. An application working with a file
from a remote server. An application working with incoming email.
Python cannot know whether you are building a client/server
application or a script for working with local files. It can't even
really know whether a file that it opens is truly local. So it
shouldn't guess.

> > If the answers are "few", then we are talking about a feature that
> > will break Windows programs and offer little value to Unix and
> > Macintosh users.
>
> How does it break more programs than assuming ASCII does? All
> encodings suitable as a system encoding are ASCII supersets, so if
> a file can't be read using the locale encoding, it can't be read
> in ASCII either.

If a program expecting ASCII sees an unknown character then it can
throw an exception and say: "You haven't thought through the
internationalization aspects properly. Read the Python docs for more
information." Silently munging data is worse. "In the face of
ambiguity, refuse the temptation to guess."

 Paul Prescod

From paul at prescod.net  Wed Sep  6 19:21:33 2006
From: paul at prescod.net (Paul Prescod)
Date: Wed, 6 Sep 2006 10:21:33 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <20060906111643.GA4412@phd.pp.ru>
References: <ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
	<44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda>
	<1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>
	<20060906104839.GE30635@phd.pp.ru>
	<1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com>
	<20060906111643.GA4412@phd.pp.ru>
Message-ID: <1cb725390609061021y11e37727kb6c94668392a36f7@mail.gmail.com>

On 9/6/06, Oleg Broytmann <phd at oper.phd.pp.ru> wrote:
> On Wed, Sep 06, 2006 at 03:55:04AM -0700, Paul Prescod wrote:
>    These situations are caused because of the lack of metadata or clear
> encoding-friendly standards. Ogg, for example, is encoding friendly - it
> clearly states that tags (comments) must be in UTF-8, and all Ogg Vorbis
> files I have saw were really in UTF-8, and all tag editors and players
> write/use UTF-8.

Michael Urman disagrees with you. He says that he sometimes sees
Latin-1 encoded files. Let's trace back how that could have happened.

1. The end-user must have had Latin-1 as their system encoding.

2. The programmer of the ID tagging app had not thought through encoding issues.

3. The programming language either implicitly encoded the data
according to the locale or treated it as binary data. (unless the
programmer did this on purpose, which would imply that he was VERY
confused and not just lazy)

>    I fail to see how Python can help here.

Python can refuse to be the programming language in Step 3 that
guesses the appropriate encoding without consulting the programmer or
end-user.

 Paul Prescod

From paul at prescod.net  Wed Sep  6 19:23:37 2006
From: paul at prescod.net (Paul Prescod)
Date: Wed, 6 Sep 2006 10:23:37 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <dcbbbb410609060618h10df3cf5r6369a4e885b8608c@mail.gmail.com>
References: <ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<44FE171C.1090101@blueyonder.co.uk>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
	<44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda>
	<1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>
	<20060906104839.GE30635@phd.pp.ru>
	<1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com>
	<20060906111643.GA4412@phd.pp.ru>
	<dcbbbb410609060618h10df3cf5r6369a4e885b8608c@mail.gmail.com>
Message-ID: <1cb725390609061023g6562f11ah7247ef356149a681@mail.gmail.com>

On 9/6/06, Michael Urman <murman at gmail.com> wrote:
> ... I suspect the best option is some sort of TextFile
> constructor that defaults to ASCII (or has no default) but accepts an
> easy way to use the "recommended" or system encoding, or any explicit
> one.

That's exactly what I'm asking for.

 Paul Prescod

From paul at prescod.net  Wed Sep  6 19:28:12 2006
From: paul at prescod.net (Paul Prescod)
Date: Wed, 6 Sep 2006 10:28:12 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <44FCC4C6.9030500@blueyonder.co.uk>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<44FCC4C6.9030500@blueyonder.co.uk>
Message-ID: <1cb725390609061028g285565dasd03dd58e80602dd9@mail.gmail.com>

On 9/4/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
>...
> I would prefer that there is no default. But since that is incompatible
> with the existing API for open(), I accept that I'm not likely to win
> that argument.

First, can you outline how the proposal of no default is incompatible
with the existing API for open?

print open("Documents/foo.py").encoding

Second: the whole IO library is being overhauled. How can backwards
compatibility be an issue?

 Paul Prescod

From murman at gmail.com  Wed Sep  6 20:28:09 2006
From: murman at gmail.com (Michael Urman)
Date: Wed, 6 Sep 2006 13:28:09 -0500
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <1cb725390609061023g6562f11ah7247ef356149a681@mail.gmail.com>
References: <ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<ca471dc20609051744u682b6e5xe6d1006337ebba4a@mail.gmail.com>
	<44FE243F.80203@blueyonder.co.uk> <87bqpttti3.fsf@qrnik.zagroda>
	<1cb725390609060308pfd0924aia44d0338b250f646@mail.gmail.com>
	<20060906104839.GE30635@phd.pp.ru>
	<1cb725390609060355u68c66a72s6d7656a8079ade7b@mail.gmail.com>
	<20060906111643.GA4412@phd.pp.ru>
	<dcbbbb410609060618h10df3cf5r6369a4e885b8608c@mail.gmail.com>
	<1cb725390609061023g6562f11ah7247ef356149a681@mail.gmail.com>
Message-ID: <dcbbbb410609061128v233e98e5y3d2fd13323e80c76@mail.gmail.com>

On 9/6/06, Paul Prescod <paul at prescod.net> wrote:
> On 9/6/06, Michael Urman <murman at gmail.com> wrote:
> > ... I suspect the best option is some sort of TextFile
> > constructor that defaults to ASCII (or has no default) but accepts an
> > easy way to use the "recommended" or system encoding, or any explicit
> > one.
>
> That's exactly what I'm asking for.

I suspect the difference in attitudes between us and those who don't
want explicit encodings is that we've dealt with the mess of
extracting information from various sources that use arbitrary
encodings, either indicated incorrectly or not at all, and we want
Python to help break that cycle. Those who want the ease of a TextFile
constructor which magically supplies the "recommended" (local?)
encoding might only deal with data in their local encoding, and aren't
aware that code like theirs provides the problem case for those who
deal with more. Not because a text file in the local encoding is a
problem, but because if they're not thinking of encoding there, they
won't where it matters.

I have to learn more about the Japanese distaste for the Unicode
system, but I don't see how that could influence me into accepting,
e.g.,  ms932 as a silently-requested encoding. Do you have any clue if
or where that fits in?

Michael
-- 
Michael Urman  http://www.tortall.net/mu/blog

From david.nospam.hopwood at blueyonder.co.uk  Thu Sep  7 02:46:11 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Thu, 07 Sep 2006 01:46:11 +0100
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <fb6fbf560609060950r3e0634cfsd32385fb902c6bb3@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>	
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>	
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	
	<44FC4B5B.9010508@blueyonder.co.uk>	
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>	
	<44FCC4C6.9030500@blueyonder.co.uk>
	<fb6fbf560609060950r3e0634cfsd32385fb902c6bb3@mail.gmail.com>
Message-ID: <44FF6BD3.6060409@blueyonder.co.uk>

Jim Jewett wrote:
> On 9/4/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> 
>> The issue is not simplicity of implementation; it is what will provide
>> the simplest usage model in the long term. If new files are encoded in X
>> just because most of a user's existing files are encoded in X, then
>> how is the user supposed to migrate to a different encoding? ...
> 
>> In practice, the system charset is often set to the charset that should
>> be used as a fallback *for applications that do not support Unicode*.
> 
> Are you assuming that most uses of open will be for new files,

No, I'm refusing to make the assumption that all uses will be for old
files.

My position is that there should be no default encoding (not ASCII either,
although I may differ with Paul Prescod on that point). Note that Py3K is
the only opportunity to remove the idea of a default encoding -- Python
2.5 by default opens text files as US-ASCII, so this would be an incompatible
API change.

If a programmer explicitly chooses to open files with the system encoding
(by adding an "encoding=sys.get_file_content_encoding()" argument to a
file open call), that's absolutely fine. In that case they must have
considered encoding issues for at least a few seconds. That is the best
we can do.

APIs that open files should also be designed to allow auto-detection of
the encoding based on content. This requires that the detected encoding
be returned from the file open call, so that if the file needs to be
rewritten, that can be done in the same encoding that was detected (which
is the behaviour least likely to break existing applications that may read
the same file).

> *and* that these files will not also be read by such unicode-ignorant
> applications?

I'm not making that assumption either.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From tomerfiliba at gmail.com  Thu Sep  7 19:30:45 2006
From: tomerfiliba at gmail.com (tomer filiba)
Date: Thu, 7 Sep 2006 19:30:45 +0200
Subject: [Python-3000] iostack, second revision
Message-ID: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>

[Guido]
> As long as the state of the decoder is "neutral" at the start of a
> line, it should be possible to do this. I like the idea that tell()
> returns a "cookie" which is really a byte offset. If one wants to be
> able to seek to positions with a non-neutral decoder state, the cookie
> would have to be more abstract. It shouldn't matter; text apps should
> not do arithmetic on seek/tell positions.

[Andres]
> In all my programming days I don't believe I written to and read from
> the same file handle even once.  Use cases exist, like if you're
> implementing a DBMS, or adding to a zip file in-place, but they're the
> exception, and by separating that functionality out in a dedicated
> class like FileBytes, you avoid having the complexities of mixed input
> and output affect your typical use cases.
[...]
> Watch out!  There's an essentiel difference between files and
> bidirectional communications channels that you need to take into
> account.  For a TCP connection, input and output can be seen as
> isolated from one another, with each their own stream position, and
> each their own contents.  For read/write files, it's a whole different
> ballgame, because stream position and data are shared.

[Talin]
> Now, I'm not saying that you can't stick additional layers in-between
> TextReader and FileStream if you want to. An example might be the
> "resync" layer that you mentioned, or a journaling layer that insures
> that all writes are recoverable. I'm merely saying that for the specific
> issue of buffering, I think that the choice of buffer type is
> complicated, and requires knowledge that might not be accessible to the
> person assembling the stack.

---

lots of things have been discussed, lots of new ideas came:
it's time to rethink the design of iostack; i'll try to see into it.

there are several key issues:
* splitting streams to separate reading and writing sides.
* the underlying OS resource can be separated into some very low
level abstraction layer, over which streams would operate.
* stateful-seek-cookies sound like the perfect solution

issues with seeking:
being opaque, there's no sense in having the long debated
position property (although i really liked it :)). i.e., there's no sense
in doing s.position += some_opaque_cookie

on the other hand, since streams are byte-oriented, over which the
data abstraction layer (text, etc.) is placed, maybe there's sense in
splitting these into two distinct APIs:

* tell()/seek() for the byte-level stream position: a stream is just a
sequence of bytes in which you can seek.
* data-abstraction-layer "pointers": pointers will be stateful stream
locations of encoded *objects*.

you will not be able to "forge" pointers, you'll first have come across
a valid object location, and then could you get a "pointer" pointing to it.
of course these pointers should be kept cheap, and for most situations,
plain integers would suffice.

example:

f = TextAdapter(BufferingLayer(FileStream(...)), encoding = "utf-32")
f.write("hello world")
p = f.get_pointer()
f.write("wide web")
f.set_pointer(p)

or using a property:
p = f.pointer
f.pointer = p

something like that....though i  would like to recv comments on
that first, before i go into deeper meditation :)


-tomer

From paul at prescod.net  Thu Sep  7 21:21:12 2006
From: paul at prescod.net (Paul Prescod)
Date: Thu, 7 Sep 2006 12:21:12 -0700
Subject: [Python-3000] Help on text editors
Message-ID: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>

Guido has asked me to do some research in aid of a file encoding
detection/defaulting PEP.

I only have access to a small number of operating systems and language
variants so I need help.

If you have access to "German Windows XP", "Japanese Windows XP", "Spanish
OS X",  "Japanese OS X", "German Ubuntu" etc., I would appreciate answers to
the following questions.

1. On US English Windows, Notepad defaults to an encoding called "ANSI".
"ANSI" is not a real encoding at all (and certainly not one from the
American National Standards Institutue -- they should sue!). ANSI is just
the default Windows character set for your localization set. What does
"ANSI" map to in European and Asian versions of Windows?

2. On my English Mac, the default character set for textedit is "Mac OS
Roman". What is it for foreign language macs? What API does an application
use to query this default character set? What setting is it derived from?
The Unix-level locale (seems not!) or some GUI-level setting (which one)?

3. In general, how do modern versions of Linux and other Unix handle this
issue? In particular: what is your default encoding and how did your
operating system determine it? Did you install a locale-specific version?
Did the installer ask you? Did you edit a configuration file? Did you change
a GUI setting? What is the relationship between your localization of
Gnome/KDE and your default encoding?

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060907/ae52d46e/attachment.htm 

From solipsis at pitrou.net  Thu Sep  7 22:13:56 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Thu, 07 Sep 2006 22:13:56 +0200
Subject: [Python-3000] Help on text editors
In-Reply-To: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
Message-ID: <1157660036.8533.18.camel@fsol>


Hi,

Le jeudi 07 septembre 2006 ? 12:21 -0700, Paul Prescod a ?crit :
> If you have access to "German Windows XP", "Japanese Windows XP",
> "Spanish OS X",  "Japanese OS X", "German Ubuntu" etc., I would
> appreciate answers to the following questions. 

French Mandriva (up-to-date development version).

> In particular: what is your default encoding and how did your
> operating system determine it?

My locale is named "fr_FR" and the encoding is iso-8859-15.

> Did you install a locale-specific version? Did the installer ask you?

No, it's the built-in config. I don't remember the installer asking me
anything except the language and keyboard layout.

> What is the relationship between your localization of Gnome/KDE and
> your default encoding? 

Ok, I hexdump'ed a few .mo files (the gettext-compatible files which
contain translation strings) and the result is a bit funny:
Gnome/KDE .mo files use utf-8, while .mo files for various command-line
tools (e.g. aspell) use iso-8859-15.

Also, it is interesting to know that Gnome tools like gedit (the Gnome
text editor) normally default to utf-8, however gedit was patched by
Mandriva to use the system encoding by default (which breaks character
set auto-detection because the Mandriva patch is awful :
http://qa.mandriva.com/show_bug.cgi?id=20277).


By the way, you should be aware that filesystems have their own
encodings which can different from the default system encoding
(depending on how it's declared in /etc/fstab). I don't know of a simple
way to retrieve the encoding for a given directory (except trying to
find out the filesystem mounting point and parsing /etc/fstab...
*sigh*). This can be annoying when handling non-ascii filenames.

Regards

Antoine.



From qrczak at knm.org.pl  Thu Sep  7 23:22:31 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Thu, 07 Sep 2006 23:22:31 +0200
Subject: [Python-3000] Help on text editors
In-Reply-To: <1157660036.8533.18.camel@fsol> (Antoine Pitrou's message of
	"Thu, 07 Sep 2006 22:13:56 +0200")
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
	<1157660036.8533.18.camel@fsol>
Message-ID: <8764fzgync.fsf@qrnik.zagroda>

Antoine Pitrou <solipsis at pitrou.net> writes:

> By the way, you should be aware that filesystems have their own
> encodings which can different from the default system encoding
> (depending on how it's declared in /etc/fstab). I don't know of a
> simple way to retrieve the encoding for a given directory (except
> trying to find out the filesystem mounting point and parsing
> /etc/fstab... *sigh*). This can be annoying when handling non-ascii
> filenames.

I believe the intent is to set up all filesystems to use the same
encoding externally. The encoding setting exists only for some
filesystems, especially those which use UTF-16 internally, where
it would be impossible to physically store filenames in the default
system encoding, or where the filesystem is likely to be created
on a different system with a different encoding.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From paul at prescod.net  Fri Sep  8 00:41:30 2006
From: paul at prescod.net (Paul Prescod)
Date: Thu, 7 Sep 2006 15:41:30 -0700
Subject: [Python-3000] Help on text editors
In-Reply-To: <1157660036.8533.18.camel@fsol>
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
	<1157660036.8533.18.camel@fsol>
Message-ID: <1cb725390609071541p308293b0u29d264f619d23d92@mail.gmail.com>

Are you plugged into the Mandriva community? Is there any debate about the
continued use of iso8859-15? Obviously it has the benefit of backwards
compatibility and slightly smaller file sizes. But it also has very severe
limitations and interoperability problems as you describe below.

On 9/7/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
>
>
> Hi,
>
> Le jeudi 07 septembre 2006 ? 12:21 -0700, Paul Prescod a ?crit :
> > If you have access to "German Windows XP", "Japanese Windows XP",
> > "Spanish OS X",  "Japanese OS X", "German Ubuntu" etc., I would
> > appreciate answers to the following questions.
>
> French Mandriva (up-to-date development version).
>
> > In particular: what is your default encoding and how did your
> > operating system determine it?
>
> My locale is named "fr_FR" and the encoding is iso-8859-15.
>
> > Did you install a locale-specific version? Did the installer ask you?
>
> No, it's the built-in config. I don't remember the installer asking me
> anything except the language and keyboard layout.
>
> > What is the relationship between your localization of Gnome/KDE and
> > your default encoding?
>
> Ok, I hexdump'ed a few .mo files (the gettext-compatible files which
> contain translation strings) and the result is a bit funny:
> Gnome/KDE .mo files use utf-8, while .mo files for various command-line
> tools (e.g. aspell) use iso-8859-15.
>
> Also, it is interesting to know that Gnome tools like gedit (the Gnome
> text editor) normally default to utf-8, however gedit was patched by
> Mandriva to use the system encoding by default (which breaks character
> set auto-detection because the Mandriva patch is awful :
> http://qa.mandriva.com/show_bug.cgi?id=20277).
>
>
> By the way, you should be aware that filesystems have their own
> encodings which can different from the default system encoding
> (depending on how it's declared in /etc/fstab). I don't know of a simple
> way to retrieve the encoding for a given directory (except trying to
> find out the filesystem mounting point and parsing /etc/fstab...
> *sigh*). This can be annoying when handling non-ascii filenames.
>
> Regards
>
> Antoine.
>
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe:
> http://mail.python.org/mailman/options/python-3000/paul%40prescod.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060907/4004913e/attachment.htm 

From guido at python.org  Fri Sep  8 01:33:43 2006
From: guido at python.org (Guido van Rossum)
Date: Thu, 7 Sep 2006 16:33:43 -0700
Subject: [Python-3000] iostack, second revision
In-Reply-To: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
Message-ID: <ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>

On 9/7/06, tomer filiba <tomerfiliba at gmail.com> wrote:
> lots of things have been discussed, lots of new ideas came:
> it's time to rethink the design of iostack; i'll try to see into it.
>
> there are several key issues:
> * splitting streams to separate reading and writing sides.
> * the underlying OS resource can be separated into some very low
> level abstraction layer, over which streams would operate.
> * stateful-seek-cookies sound like the perfect solution
>
> issues with seeking:
> being opaque, there's no sense in having the long debated
> position property (although i really liked it :)). i.e., there's no sense
> in doing s.position += some_opaque_cookie
>
> on the other hand, since streams are byte-oriented, over which the
> data abstraction layer (text, etc.) is placed, maybe there's sense in
> splitting these into two distinct APIs:
>
> * tell()/seek() for the byte-level stream position: a stream is just a
> sequence of bytes in which you can seek.
> * data-abstraction-layer "pointers": pointers will be stateful stream
> locations of encoded *objects*.
>
> you will not be able to "forge" pointers, you'll first have come across
> a valid object location, and then could you get a "pointer" pointing to it.
> of course these pointers should be kept cheap, and for most situations,
> plain integers would suffice.

Using plain ints makes them trivially forgeable though. Not sure I
mind, just noticing.

> example:
>
> f = TextAdapter(BufferingLayer(FileStream(...)), encoding = "utf-32")
> f.write("hello world")
> p = f.get_pointer()
> f.write("wide web")
> f.set_pointer(p)

Why not use tell() and seek() instead of get_pointer() and
set_pointer()? Seek should also support several special cases:
f.seek(0) seeks to the start of the file no matter what type is
otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a
no-op, f.seek(0, 2) seeks to EOF.

> or using a property:
> p = f.pointer
> f.pointer = p

Since the creation of a seek cookie may be relatively expensive (since
it may have to ask the decoder a rather personal question :-) it
should be a method, not a property.

> something like that....though i  would like to recv comments on
> that first, before i go into deeper meditation :)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From hasan.diwan at gmail.com  Fri Sep  8 02:41:30 2006
From: hasan.diwan at gmail.com (Hasan Diwan)
Date: Thu, 7 Sep 2006 17:41:30 -0700
Subject: [Python-3000] iostack, second revision
In-Reply-To: <ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>
Message-ID: <2cda2fc90609071741t1b7fc4a9gff48836d187367da@mail.gmail.com>

I was thinking about the new IOStack and could not come up with an use case
requiring both a line-oriented and a record-oriented read/write
functionality -- the general case is the record-oriented, lines are just
new-line terminated records. Perhaps this has already been dropped, but I
seem to recall the original spec having a readrec, writerec?  Similarly,
readline/writeline aren't needed. For example...

class InputStream(Stream):
   def read(self): # Reads 1 byte
       return os.stdin.read(1)

   def readline(self):
       ret = self.readrec('\n') # or whatever constant represents the EOL
       return ret

class Stream(object):
   def read(self):
       raise Exception, 'cannot read'

   def readrec(self,terminator):
       ret = ''
       while ret != terminator: ret = ret + self.read()
       return ret

   def write(self):
       raise Exception, 'cannot write'

   def writeRec(self, terminator):
       ''' writeRec returns self as a list split by terminator '''
       ret = str(self)
       return str(ret).split(terminator)
-- 
Cheers,
Hasan Diwan <hasan.diwan at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060907/2769816f/attachment.htm 

From murman at gmail.com  Fri Sep  8 03:05:09 2006
From: murman at gmail.com (Michael Urman)
Date: Thu, 7 Sep 2006 20:05:09 -0500
Subject: [Python-3000] Help on text editors
In-Reply-To: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
Message-ID: <dcbbbb410609071805m7f902564xd6547b19ddec3528@mail.gmail.com>

On 9/7/06, Paul Prescod <paul at prescod.net> wrote:
> 1. On US English Windows, Notepad defaults to an encoding called "ANSI".
> What does "ANSI" map to in European and Asian versions of Windows?

On most Western European configurations, the ANSI Code Page is
historically 1252 (CP1252 or WINDOWS-1252 according to iconv). It may
be something different now for supporting the EURO symbol. Japanese
machines tend to use CP932 (or MS932), also known as SHIFT-JIS (or
close enough). I don't know exactly which ACPs match other languages
off the top of my head.

I expect notepad will default to the ACP encoding whenever a file is
detected as such, or a new file contains only characters representable
via that code page. Otherwise I expect it will default to "Unicode"
(UTF-16 / UCS-2). When editing an existing file, it will default to
the detected encoding, unless "Unicode" is required to save the
changes. It uses BOMs to mark all unicode encodings, but doesn't
require them to be present in order to detect "Unicode."
http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx

> 3. In general, how do modern versions of Linux and other Unix handle this
> issue?

I use en-US.UTF-8, after many years of C or en-US.ISO-8859-1. Due to
the age of my install, this was not the default, but now I use it as
pervasively as possible. I set it via GDM these days, but via my shell
rc file originally.

Michael
-- 
Michael Urman  http://www.tortall.net/mu/blog

From david.nospam.hopwood at blueyonder.co.uk  Fri Sep  8 04:03:55 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Fri, 08 Sep 2006 03:03:55 +0100
Subject: [Python-3000] Help on text editors
In-Reply-To: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
Message-ID: <4500CF8B.6040003@blueyonder.co.uk>

Paul Prescod wrote:
> Guido has asked me to do some research in aid of a file encoding
> detection/defaulting PEP.
> 
> I only have access to a small number of operating systems and language
> variants so I need help.
> 
> If you have access to "German Windows XP", "Japanese Windows XP",

Since Win2K there is actually no such thing, from a technical point of view --
just Win2K or WinXP with a German or Japanese "language group" installed,
and a corresponding locale selected as the interface locale for a given user
account. The links below should make this clearer.

> "Spanish OS X",  "Japanese OS X", "German Ubuntu" etc., I would appreciate
> answers to the following questions.
> 
> 1. On US English Windows, Notepad defaults to an encoding called "ANSI".
> "ANSI" is not a real encoding at all (and certainly not one from the
> American National Standards Institute -- they should sue!). ANSI is just
> the default Windows character set for your localization set. What does
> "ANSI" map to in European and Asian versions of Windows?

See <http://www.microsoft.com/globaldev/DrIntl/faqs/Locales.mspx>,
<http://www.microsoft.com/globaldev/reference/WinCP.mspx>, and
<http://www.microsoft.com/globaldev/reference/win2k/setup/localsupport.mspx>.

Each "language group" maps to a similarly named "ANSI" code page (and also
an "OEM" code page) in the obvious way.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From david.nospam.hopwood at blueyonder.co.uk  Fri Sep  8 04:12:27 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Fri, 08 Sep 2006 03:12:27 +0100
Subject: [Python-3000] Help on text editors
In-Reply-To: <4500CF8B.6040003@blueyonder.co.uk>
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
	<4500CF8B.6040003@blueyonder.co.uk>
Message-ID: <4500D18B.1040404@blueyonder.co.uk>

David Hopwood wrote:
> Paul Prescod wrote:
> 
>>Guido has asked me to do some research in aid of a file encoding
>>detection/defaulting PEP.
>>
>>I only have access to a small number of operating systems and language
>>variants so I need help.
>>
>>If you have access to "German Windows XP", "Japanese Windows XP",
> 
> Since Win2K there is actually no such thing, from a technical point of view --
> just Win2K or WinXP with a German or Japanese "language group" installed,

This is right...

> and a corresponding locale selected as the interface locale for a given user account.

Correction: the "System Locale" is what determines the ANSI and OEM codepages,
and this is *not* dependent on the user account. Changing it requires a reboot,
so you can assume that it stays constant for the lifetime of a Python process.

> The links below should make this clearer.

I obviously should have read them more thoroughly myself! :-(

> See <http://www.microsoft.com/globaldev/DrIntl/faqs/Locales.mspx>,
> <http://www.microsoft.com/globaldev/reference/WinCP.mspx>, and
> <http://www.microsoft.com/globaldev/reference/win2k/setup/localsupport.mspx>.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From david.nospam.hopwood at blueyonder.co.uk  Fri Sep  8 04:46:40 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Fri, 08 Sep 2006 03:46:40 +0100
Subject: [Python-3000] Help on text editors
In-Reply-To: <dcbbbb410609071805m7f902564xd6547b19ddec3528@mail.gmail.com>
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
	<dcbbbb410609071805m7f902564xd6547b19ddec3528@mail.gmail.com>
Message-ID: <4500D990.3000707@blueyonder.co.uk>

Michael Urman wrote:
> On 9/7/06, Paul Prescod <paul at prescod.net> wrote:
> 
>>1. On US English Windows, Notepad defaults to an encoding called "ANSI".
>>What does "ANSI" map to in European and Asian versions of Windows?
> 
> On most Western European configurations, the ANSI Code Page is
> historically 1252 (CP1252 or WINDOWS-1252 according to iconv). It may
> be something different now for supporting the EURO symbol.

None of the Windows-125x code page numbers changed when '?' was added. These
are "open" encodings in the Unicode and ISO terminology; i.e. there is an
authority (Microsoft) who can assign any previously unassigned code point at
any time.

> Japanese machines tend to use CP932 (or MS932), also known as SHIFT-JIS (or
> close enough).

Not close enough, actually. Cp932 is a superset of US-ASCII, whereas Shift-JIS
isn't: 0x5C represents '\' and '?' respectively. If you think about how
important '\' is as an escaping metacharacter, this is quite a big deal
(there are other differences, but they are less important). Actual practice
in Japan is that 0x5C *can* be used as an escaping metacharacter with the
semantics of '\' (even if it is sometimes displayed as '?'), and so Cp932 is
the encoding that should be used, even on non-Microsoft OSes.

> I expect notepad will default to the ACP encoding whenever a file is
> detected as such, or a new file contains only characters representable
> via that code page. Otherwise I expect it will default to "Unicode"
> (UTF-16 / UCS-2). When editing an existing file, it will default to
> the detected encoding, unless "Unicode" is required to save the
> changes. It uses BOMs to mark all unicode encodings, but doesn't
> require them to be present in order to detect "Unicode."
> http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx

Yes. However, this is not a good idea for precisely the reason described
on that page (false detection of Unicode), and so any Unicode detection
algorithm in Python should only be based on detecting a BOM, IMHO.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>





From jeff at soft.fujitsu.com  Fri Sep  8 05:09:26 2006
From: jeff at soft.fujitsu.com (Jeff Wilcox)
Date: Fri, 8 Sep 2006 12:09:26 +0900
Subject: [Python-3000] Help on text editors
In-Reply-To: <mailman.528.1157676092.5278.python-3000@python.org>
Message-ID: <LEEEILEJNIIMMBJHAKAPGEJNCCAA.jeff@soft.fujitsu.com>

> From: "Paul Prescod" <paul at prescod.net>
> 1. On US English Windows, Notepad defaults to an encoding called "ANSI".
> "ANSI" is not a real encoding at all (and certainly not one from the
On Japanese Windows 2000, Notepad defaults to ANSI as it does in the English
version.  It actually writes Shift JIS though.

> 2. On my English Mac, the default character set for textedit is "Mac OS
> Roman". What is it for foreign language macs? What API does an application
> use to query this default character set? What setting is it derived from?
> The Unix-level locale (seems not!) or some GUI-level setting (which one)?

Mac OS X actually doesn't have different language versions of the operating
system.  If you change the language setting, the Japanese version *becomes*
the English version and vice versa. (Several of the English speakers that I
work with have purchased Japanese Macs and switched them over to English,
they're indistinguishable from English Macs afterwards.  Similarly, several
Macs purchased in the US have been successfully switched to Japanese, and
become indistinguishable from Macs bought in Japan.)

> 3. In general, how do modern versions of Linux and other Unix handle this
> issue? In particular: what is your default encoding and how did your

On Vine Linux (popular in Japan), the default text encoding is EUC with no
configuration changes.



From solipsis at pitrou.net  Fri Sep  8 09:02:48 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Fri, 08 Sep 2006 09:02:48 +0200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>
Message-ID: <1157698968.4636.3.camel@fsol>

Le jeudi 07 septembre 2006 ? 16:33 -0700, Guido van Rossum a ?crit :
> Why not use tell() and seek() instead of get_pointer() and
> set_pointer()? Seek should also support several special cases:
> f.seek(0) seeks to the start of the file no matter what type is
> otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a
> no-op, f.seek(0, 2) seeks to EOF.

Perhaps it would be good to drop those magic numbers (0, 1, 2) for
seek() ? They don't really help readibility except perhaps for people
who still do a lot of C ;)

Regards

Antoine.



From solipsis at pitrou.net  Fri Sep  8 09:08:41 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Fri, 08 Sep 2006 09:08:41 +0200
Subject: [Python-3000] Help on text editors
In-Reply-To: <1cb725390609071541p308293b0u29d264f619d23d92@mail.gmail.com>
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
	<1157660036.8533.18.camel@fsol>
	<1cb725390609071541p308293b0u29d264f619d23d92@mail.gmail.com>
Message-ID: <1157699321.4636.9.camel@fsol>


Le jeudi 07 septembre 2006 ? 15:41 -0700, Paul Prescod a ?crit :
> Are you plugged into the Mandriva community?

Not much. I only participe in bug reports ;)

> Is there any debate about the continued use of iso8859-15?

I think there has been some for years. Some people in the community push
for UTF-8 but I guess the problem is related to Mandriva company
management or priority setting policy.

Regards

Antoine.



From hasan.diwan at gmail.com  Fri Sep  8 09:26:55 2006
From: hasan.diwan at gmail.com (Hasan Diwan)
Date: Fri, 8 Sep 2006 00:26:55 -0700
Subject: [Python-3000] iostack, second revision
In-Reply-To: <1157698968.4636.3.camel@fsol>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>
	<1157698968.4636.3.camel@fsol>
Message-ID: <2cda2fc90609080026s7184815fh19345ba764d03c90@mail.gmail.com>

On 08/09/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
>
> Perhaps it would be good to drop those magic numbers (0, 1, 2) for
> seek() ? They don't really help readibility except perhaps for people
> who still do a lot of C ;)
>

+1
If we can't don't want to eliminate the "magic numbers" entirely, perhaps we
could assign symbolic constants to them? fileobj.seek(fileobj.START) for
instance?
-- 
Cheers,
Hasan Diwan <hasan.diwan at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060908/1477f322/attachment.html 

From tomerfiliba at gmail.com  Fri Sep  8 10:53:33 2006
From: tomerfiliba at gmail.com (tomer filiba)
Date: Fri, 8 Sep 2006 10:53:33 +0200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>
Message-ID: <1d85506f0609080153x685c5d5fga6f3352830fa394b@mail.gmail.com>

> Why not use tell() and seek() instead of get_pointer() and
> set_pointer()?

because, at least the way i see it, seek and tell are byte-oriented,
while the upper layers of the stack may be objects-oriented
(including, for instance, characters, struct records, or pickled objects),
so pointers would be a vector of (byte-position, stateful object-layer info).

pointers are different than mere byte-positions, so i thought streams
should have a byte-level API, while the upper layers are more likely
to work with "pointers".

[Guido]
> Seek should also support several special cases:
> f.seek(0) seeks to the start of the file no matter what type is
> otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a
> no-op, f.seek(0, 2) seeks to EOF.

[Antoine]
> Perhaps it would be good to drop those magic numbers (0, 1, 2) for
> seek() ? They don't really help readibility except perhaps for people
> who still do a lot of C ;)

yes, this was discussed some time ago. we concluded that the new
position property should behave similar to negative indexes:

f.position = 5  --  absolute seek, from the beginning of the stream
f.position += 3  --  relative seek (*)
f.position = -1  -- absolute seeking, back from the end (**)

(*) it requires two syscalls, so we'll also have a seekby() method
(**) like "hello"[-1]

imho it's much more simple and intuitive than these magic consts,
and it feels more like object indexing.


-tomer

On 9/8/06, Guido van Rossum <guido at python.org> wrote:
> On 9/7/06, tomer filiba <tomerfiliba at gmail.com> wrote:
> > lots of things have been discussed, lots of new ideas came:
> > it's time to rethink the design of iostack; i'll try to see into it.
> >
> > there are several key issues:
> > * splitting streams to separate reading and writing sides.
> > * the underlying OS resource can be separated into some very low
> > level abstraction layer, over which streams would operate.
> > * stateful-seek-cookies sound like the perfect solution
> >
> > issues with seeking:
> > being opaque, there's no sense in having the long debated
> > position property (although i really liked it :)). i.e., there's no sense
> > in doing s.position += some_opaque_cookie
> >
> > on the other hand, since streams are byte-oriented, over which the
> > data abstraction layer (text, etc.) is placed, maybe there's sense in
> > splitting these into two distinct APIs:
> >
> > * tell()/seek() for the byte-level stream position: a stream is just a
> > sequence of bytes in which you can seek.
> > * data-abstraction-layer "pointers": pointers will be stateful stream
> > locations of encoded *objects*.
> >
> > you will not be able to "forge" pointers, you'll first have come across
> > a valid object location, and then could you get a "pointer" pointing to it.
> > of course these pointers should be kept cheap, and for most situations,
> > plain integers would suffice.
>
> Using plain ints makes them trivially forgeable though. Not sure I
> mind, just noticing.
>
> > example:
> >
> > f = TextAdapter(BufferingLayer(FileStream(...)), encoding = "utf-32")
> > f.write("hello world")
> > p = f.get_pointer()
> > f.write("wide web")
> > f.set_pointer(p)
>
> Why not use tell() and seek() instead of get_pointer() and
> set_pointer()? Seek should also support several special cases:
> f.seek(0) seeks to the start of the file no matter what type is
> otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a
> no-op, f.seek(0, 2) seeks to EOF.
>
> > or using a property:
> > p = f.pointer
> > f.pointer = p
>
> Since the creation of a seek cookie may be relatively expensive (since
> it may have to ask the decoder a rather personal question :-) it
> should be a method, not a property.
>
> > something like that....though i  would like to recv comments on
> > that first, before i go into deeper meditation :)
>
> --
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>

From qrczak at knm.org.pl  Fri Sep  8 11:17:36 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Fri, 08 Sep 2006 11:17:36 +0200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <1d85506f0609080153x685c5d5fga6f3352830fa394b@mail.gmail.com>
	(tomer filiba's message of "Fri, 8 Sep 2006 10:53:33 +0200")
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>
	<1d85506f0609080153x685c5d5fga6f3352830fa394b@mail.gmail.com>
Message-ID: <87zmda1zv3.fsf@qrnik.zagroda>

"tomer filiba" <tomerfiliba at gmail.com> writes:

> yes, this was discussed some time ago. we concluded that the new
> position property should behave similar to negative indexes:
>
> f.position = 5  --  absolute seek, from the beginning of the stream
> f.position += 3  --  relative seek (*)
> f.position = -1  -- absolute seeking, back from the end (**)

Seeking to the very end requires a special constant, otherwise
it's off by 1.

I don't understand so strong desire to push that syntax despite
its problems with implementing += in one syscall and with specifying
the end point. If it doesn't work well, don't do it that way.

Of course magic constants are bad. My language Kogut has three
separate functions for seeking, it's simpler than interpreting
non-negative and negative numbers differently (and it can even seek
past the end if the OS supports that). I can't imagine a case where
the origin of seeking is not known statically.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From ncoghlan at gmail.com  Fri Sep  8 12:31:48 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 08 Sep 2006 20:31:48 +1000
Subject: [Python-3000] iostack, second revision
In-Reply-To: <1d85506f0609080153x685c5d5fga6f3352830fa394b@mail.gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>	<ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>
	<1d85506f0609080153x685c5d5fga6f3352830fa394b@mail.gmail.com>
Message-ID: <45014694.2030609@gmail.com>

[Guido]
>> Why not use tell() and seek() instead of get_pointer() and
>> set_pointer()?

[tomer]
> because, at least the way i see it, seek and tell are byte-oriented,
> while the upper layers of the stack may be objects-oriented
> (including, for instance, characters, struct records, or pickled objects),
> so pointers would be a vector of (byte-position, stateful object-layer info).
> 
> pointers are different than mere byte-positions, so i thought streams
> should have a byte-level API, while the upper layers are more likely
> to work with "pointers".

seek() & tell() aren't necessarily byte-oriented, and a program can get itself 
in trouble by treating them as if they are. Seeking to an arbitrary byte 
position on a Windows text file can be a very bad idea :)

So -1 on using different names, but +1 on permitting different IO layers to 
assign a different meaning to exactly what it is that seek() and tell() are 
indexing.

With the IO layer doing a translation, I suggest that the seek/tell cookies 
should be plain integers, so that doing f.seek(20) on a text file will seek to 
the 20th character instead of the 20th byte. This approach is backwards 
compatible with the current rule of 'for text files, arguments to seek() must 
be previously returned from tell()' and Guido's desire that f.seek(0) always 
seek to the beginning of the file.

> 
> [Guido]
>> Seek should also support several special cases:
>> f.seek(0) seeks to the start of the file no matter what type is
>> otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a
>> no-op, f.seek(0, 2) seeks to EOF.
> 
> [Antoine]
>> Perhaps it would be good to drop those magic numbers (0, 1, 2) for
>> seek() ? They don't really help readibility except perhaps for people
>> who still do a lot of C ;)

Since I've been playing with string methods lately, I believe a natural name 
for the 'seek from the end' version is f.rseek(0). And someone else suggested 
f.seekby(0) as a reasonable name for relative seeking.

f.seek(0)   # Go to beginning
f.seekby(0) # Stay at current position
f.rseek(0)  # Go to end

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From exarkun at divmod.com  Fri Sep  8 14:29:57 2006
From: exarkun at divmod.com (Jean-Paul Calderone)
Date: Fri, 8 Sep 2006 08:29:57 -0400
Subject: [Python-3000] iostack, second revision
In-Reply-To: <2cda2fc90609080026s7184815fh19345ba764d03c90@mail.gmail.com>
Message-ID: <20060908122957.1717.1038742216.divmod.quotient.42846@ohm>

On Fri, 8 Sep 2006 00:26:55 -0700, Hasan Diwan <hasan.diwan at gmail.com> wrote:
>On 08/09/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
>>
>>Perhaps it would be good to drop those magic numbers (0, 1, 2) for
>>seek() ? They don't really help readibility except perhaps for people
>>who still do a lot of C ;)
>
>+1
>If we can't don't want to eliminate the "magic numbers" entirely, perhaps we
>could assign symbolic constants to them? fileobj.seek(fileobj.START) for
>instance?

Note that Python is _worse_ than C here.  C has named constants for these,
Python does not expose them.

Jean-Paul

From ronaldoussoren at mac.com  Fri Sep  8 15:37:00 2006
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Fri, 08 Sep 2006 15:37:00 +0200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <20060908122957.1717.1038742216.divmod.quotient.42846@ohm>
References: <20060908122957.1717.1038742216.divmod.quotient.42846@ohm>
Message-ID: <10397427.1157722620731.JavaMail.ronaldoussoren@mac.com>

 
On Friday, September 08, 2006, at 02:30PM, Jean-Paul Calderone <exarkun at divmod.com> wrote:

>On Fri, 8 Sep 2006 00:26:55 -0700, Hasan Diwan <hasan.diwan at gmail.com> wrote:
>>On 08/09/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
>>>
>>>Perhaps it would be good to drop those magic numbers (0, 1, 2) for
>>>seek() ? They don't really help readibility except perhaps for people
>>>who still do a lot of C ;)
>>
>>+1
>>If we can't don't want to eliminate the "magic numbers" entirely, perhaps we
>>could assign symbolic constants to them? fileobj.seek(fileobj.START) for
>>instance?
>
>Note that Python is _worse_ than C here.  C has named constants for these,
>Python does not expose them.

What about os.SEEK_SET, os.SEEK_CUR, os.SEEK_END? The named constants are there, just not at the most convenient location.

Ronald
>
>Jean-Paul
>_______________________________________________
>Python-3000 mailing list
>Python-3000 at python.org
>http://mail.python.org/mailman/listinfo/python-3000
>Unsubscribe: http://mail.python.org/mailman/options/python-3000/ronaldoussoren%40mac.com
>
>

From exarkun at divmod.com  Fri Sep  8 15:40:42 2006
From: exarkun at divmod.com (Jean-Paul Calderone)
Date: Fri, 8 Sep 2006 09:40:42 -0400
Subject: [Python-3000] iostack, second revision
In-Reply-To: <10397427.1157722620731.JavaMail.ronaldoussoren@mac.com>
Message-ID: <20060908134042.1717.1143631052.divmod.quotient.42896@ohm>

On Fri, 08 Sep 2006 15:37:00 +0200, Ronald Oussoren <ronaldoussoren at mac.com> wrote:
>
>On Friday, September 08, 2006, at 02:30PM, Jean-Paul Calderone <exarkun at divmod.com> wrote:
>>
>>Note that Python is _worse_ than C here.  C has named constants for these,
>>Python does not expose them.
>
>What about os.SEEK_SET, os.SEEK_CUR, os.SEEK_END? The named constants are there, just not at the most convenient location.

New in Python 2.5, so Python will finally be caught up with C when 2.5
final is released :)

Jean-Paul

From ronaldoussoren at mac.com  Fri Sep  8 16:06:30 2006
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Fri, 08 Sep 2006 16:06:30 +0200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <20060908134042.1717.1143631052.divmod.quotient.42896@ohm>
References: <20060908134042.1717.1143631052.divmod.quotient.42896@ohm>
Message-ID: <5355804.1157724390887.JavaMail.ronaldoussoren@mac.com>

 
On Friday, September 08, 2006, at 03:41PM, Jean-Paul Calderone <exarkun at divmod.com> wrote:

>On Fri, 08 Sep 2006 15:37:00 +0200, Ronald Oussoren <ronaldoussoren at mac.com> wrote:
>>
>>On Friday, September 08, 2006, at 02:30PM, Jean-Paul Calderone <exarkun at divmod.com> wrote:
>>>
>>>Note that Python is _worse_ than C here.  C has named constants for these,
>>>Python does not expose them.
>>
>>What about os.SEEK_SET, os.SEEK_CUR, os.SEEK_END? The named constants are there, just not at the most convenient location.
>
>New in Python 2.5, so Python will finally be caught up with C when 2.5
>final is released :)

The same constants are also defined in posixfile, which even according to python 2.3 is deprecated. Sigh...

Ronald


From guido at python.org  Fri Sep  8 18:37:13 2006
From: guido at python.org (Guido van Rossum)
Date: Fri, 8 Sep 2006 09:37:13 -0700
Subject: [Python-3000] iostack, second revision
In-Reply-To: <1157698968.4636.3.camel@fsol>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>
	<1157698968.4636.3.camel@fsol>
Message-ID: <ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>

On 9/8/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Le jeudi 07 septembre 2006 ? 16:33 -0700, Guido van Rossum a ?crit :
> > Why not use tell() and seek() instead of get_pointer() and
> > set_pointer()? Seek should also support several special cases:
> > f.seek(0) seeks to the start of the file no matter what type is
> > otherwise used for pointers ("seek cookies" ?), f.seek(0, 1) is a
> > no-op, f.seek(0, 2) seeks to EOF.
>
> Perhaps it would be good to drop those magic numbers (0, 1, 2) for
> seek() ? They don't really help readibility except perhaps for people
> who still do a lot of C ;)

Maybe (since I fall in that category it doesn't bother me :-), but we
shouldn't replace them with symbolic constants. Having to import
another module to import names like SEEK_CUR and SEEK_END is not
Pythonic. Perhaps the seek() method can grow keyword arguments to
indicate the different types of seekage, or there should be three
separate methods.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From mcherm at mcherm.com  Fri Sep  8 18:45:50 2006
From: mcherm at mcherm.com (Michael Chermside)
Date: Fri, 08 Sep 2006 09:45:50 -0700
Subject: [Python-3000] The future of exceptions
Message-ID: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com>

Marcin Kowalczyk writes:
> In my language the traceback is materialized from the stack only
> if needed [...] The stack is not
> physically unwound until an exception handler completes successfully,
> so the data is available until then.

Jim Jewett writes:
> Even today, if a StopIteration() participates in a reference cycle,
> then it won't be reclaimed until the next gc run.  I'm not quite sure
> which direction should be a weakref, but I think it would be
> reasonable for the cycle to get broken when an catching except block
> exits without reraising.

When thinking about these things, don't forget that in Python an
exception handler can perform complicated actions, including invoking
new functions and possibly raising new exceptions. Any solution should
allow the following code to work "properly":

   # -- WARNING: demo code, not tested

   def logError(msg):
       try:
           errorChannel.write(msg)
       except IOError:
           pass

   try:
       callSomeCode()
   except SomeException as err:
       msg = str(msg)
       logError(msg)
       raise msg


By "properly" I mean that that when callSomeCode() raises SomeException
the uncaught exception will cause the program should print a stacktrace
which should correcly show the stack frame of callSomeCode(). This
should happen regardless of whether errorChannel raised an IOError.
In the process, though, we (1) added new frames to the stack, and (2)
successfully exited an error handler (the one for IOError).

It is work to provide this feature but without it Python programmers
cannot freely use any code they like within exception handlers, which
I think is an important feature. It doesn't necessarily imply that
the traceback be materialized immediately upon exception creation
(which is undesirable because we want exceptions lightweight enough
to use for things like for loop control!)... but it might mean that
pieces of the stack frame need hang around as long as the exception
itself does.

-- Michael Chermside


From ncoghlan at gmail.com  Fri Sep  8 19:00:33 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sat, 09 Sep 2006 03:00:33 +1000
Subject: [Python-3000] iostack, second revision
In-Reply-To: <ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>	<ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>	<1157698968.4636.3.camel@fsol>
	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>
Message-ID: <4501A1B1.5050707@gmail.com>

Guido van Rossum wrote:
> Maybe (since I fall in that category it doesn't bother me :-), but we
> shouldn't replace them with symbolic constants. Having to import
> another module to import names like SEEK_CUR and SEEK_END is not
> Pythonic. Perhaps the seek() method can grow keyword arguments to
> indicate the different types of seekage, or there should be three
> separate methods.

As I mentioned in a different part of the thread, I believe seek(), seekby() 
and rseek() would work as names for the 3 different method approach.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From talin at acm.org  Fri Sep  8 19:12:13 2006
From: talin at acm.org (Talin)
Date: Fri, 08 Sep 2006 10:12:13 -0700
Subject: [Python-3000] iostack, second revision
In-Reply-To: <4501A1B1.5050707@gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>	<ca471dc20609071633p60a0f7c2m2186dba1e30e68a7@mail.gmail.com>	<1157698968.4636.3.camel@fsol>	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>
	<4501A1B1.5050707@gmail.com>
Message-ID: <4501A46D.20007@acm.org>

Nick Coghlan wrote:
> Guido van Rossum wrote:
>> Maybe (since I fall in that category it doesn't bother me :-), but we
>> shouldn't replace them with symbolic constants. Having to import
>> another module to import names like SEEK_CUR and SEEK_END is not
>> Pythonic. Perhaps the seek() method can grow keyword arguments to
>> indicate the different types of seekage, or there should be three
>> separate methods.
> 
> As I mentioned in a different part of the thread, I believe seek(), seekby() 
> and rseek() would work as names for the 3 different method approach.
> 
> Cheers,
> Nick.
> 

One advantage of that approach is that layers which don't support a 
particular operation could omit one or more of those functions, or have 
differently-named functions that represent what the layer is capable of. 
For example, if a layer is only capable of seeking forward, you could 
use 'skip' like the Java stream does; If a layer can rewind the stream 
back to zero, but not to any intermediate position, you could have a 
'reset' method.

By taking this approach, you can come up with an API for a given layer 
that fits naturally into the behavior model of that layer, without 
trying to cram it into a generic model for seeking that attempts to 
cover all cases. For text streams, come up with a model that makes sense 
for what kinds of things you want to do with text, and don't try and 
make it look like the API for the underlying byte stream.

-- Talin

From aahz at pythoncraft.com  Fri Sep  8 19:21:51 2006
From: aahz at pythoncraft.com (Aahz)
Date: Fri, 8 Sep 2006 10:21:51 -0700
Subject: [Python-3000] The future of exceptions
In-Reply-To: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com>
References: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com>
Message-ID: <20060908172151.GA9911@panix.com>

On Fri, Sep 08, 2006, Michael Chermside wrote:
>
>    def logError(msg):
>        try:
>            errorChannel.write(msg)
>        except IOError:
>            pass
> 
>    try:
>        callSomeCode()
>    except SomeException as err:
>        msg = str(msg)
>        logError(msg)
>        raise msg

This code is guaranteed to fail in Python 3.0, of course, because string
exceptions aren't allowed.  But your point is taken, I think.
-- 
Aahz (aahz at pythoncraft.com)           <*>         http://www.pythoncraft.com/

"LL YR VWL R BLNG T S"  -- www.nancybuttons.com

From fdrake at acm.org  Fri Sep  8 20:03:17 2006
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 8 Sep 2006 14:03:17 -0400
Subject: [Python-3000] iostack, second revision
In-Reply-To: <4501A1B1.5050707@gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>
	<4501A1B1.5050707@gmail.com>
Message-ID: <200609081403.18350.fdrake@acm.org>

On Friday 08 September 2006 13:00, Nick Coghlan wrote:
 > As I mentioned in a different part of the thread, I believe seek(),
 > seekby() and rseek() would work as names for the 3 different method
 > approach.

+1, for the reasons discussed.


  -Fred

-- 
Fred L. Drake, Jr.   <fdrake at acm.org>

From guido at python.org  Fri Sep  8 20:06:41 2006
From: guido at python.org (Guido van Rossum)
Date: Fri, 8 Sep 2006 11:06:41 -0700
Subject: [Python-3000] iostack, second revision
In-Reply-To: <200609081403.18350.fdrake@acm.org>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>
	<4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org>
Message-ID: <ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>

-1 on those particular cryptic names. Which one of seekby() and
rseek() is the relative seek? Where's the seek relative to EOF?

On 9/8/06, Fred L. Drake, Jr. <fdrake at acm.org> wrote:
> On Friday 08 September 2006 13:00, Nick Coghlan wrote:
>  > As I mentioned in a different part of the thread, I believe seek(),
>  > seekby() and rseek() would work as names for the 3 different method
>  > approach.
>
> +1, for the reasons discussed.
>
>
>   -Fred
>
> --
> Fred L. Drake, Jr.   <fdrake at acm.org>
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From solipsis at pitrou.net  Fri Sep  8 20:41:13 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Fri, 08 Sep 2006 20:41:13 +0200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>
	<4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org>
	<ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>
Message-ID: <1157740873.4979.10.camel@fsol>

Le vendredi 08 septembre 2006 ? 11:06 -0700, Guido van Rossum a ?crit :
> -1 on those particular cryptic names. Which one of seekby() and
> rseek() is the relative seek? Where's the seek relative to EOF?

What about seek(), seek_relative() and seek_reverse() ?

"rseek" also looks like "relative seek" to me (having be used to move /
rmove for graphic primitives a long time ago).

Regards

Antoine.



From jimjjewett at gmail.com  Fri Sep  8 21:04:50 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Fri, 8 Sep 2006 15:04:50 -0400
Subject: [Python-3000] iostack, second revision
In-Reply-To: <1157740873.4979.10.camel@fsol>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>
	<4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org>
	<ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>
	<1157740873.4979.10.camel@fsol>
Message-ID: <fb6fbf560609081204t1f4f1a6fo817e533f306f4861@mail.gmail.com>

On 9/8/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Le vendredi 08 septembre 2006 ? 11:06 -0700, Guido van Rossum a ?crit :
> > -1 on those particular cryptic names. Which one of seekby() and
> > rseek() is the relative seek? Where's the seek relative to EOF?

> What about seek(), seek_relative() and seek_reverse() ?

Why not just borrow the standard symbolic names of cur and end?

    seek(pos=0)
    seek_cur(pos=0)
    seek_end(pos=0)

    seek_end(-1000)    <==> 1000 units (bytes or chars or records or
...) before the end
    seek_cur(50)          <==> 50 units beyond current
    seek()                     <==> beginning

-jJ

From qrczak at knm.org.pl  Fri Sep  8 21:21:22 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Fri, 08 Sep 2006 21:21:22 +0200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>
	(Guido van Rossum's message of "Fri, 8 Sep 2006 11:06:41 -0700")
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>
	<4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org>
	<ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>
Message-ID: <87bqpqqi4t.fsf@qrnik.zagroda>

"Guido van Rossum" <guido at python.org> writes:

> -1 on those particular cryptic names. Which one of seekby() and
> rseek() is the relative seek? Where's the seek relative to EOF?

I propose seek, seek_by, seek_end.

I suppose in 99% of cases seek_end is used to seek to the very end,
rather than some offset from the end, so it makes sense for the offset
to be optional.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From fdrake at acm.org  Sat Sep  9 00:06:08 2006
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 8 Sep 2006 18:06:08 -0400
Subject: [Python-3000] iostack, second revision
In-Reply-To: <ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<200609081403.18350.fdrake@acm.org>
	<ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>
Message-ID: <200609081806.09032.fdrake@acm.org>

On Friday 08 September 2006 14:06, Guido van Rossum wrote:
 > -1 on those particular cryptic names. Which one of seekby() and
 > rseek() is the relative seek? Where's the seek relative to EOF?

My reading was seekby() as relative, and rseek() was relative to the end.  It 
could be something like seekposition(), seekforward(), seekfromend().  Long, 
but unambiguous.


  -Fred

-- 
Fred L. Drake, Jr.   <fdrake at acm.org>

From solipsis at pitrou.net  Sat Sep  9 00:24:10 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sat, 09 Sep 2006 00:24:10 +0200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <fb6fbf560609081204t1f4f1a6fo817e533f306f4861@mail.gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>
	<4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org>
	<ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>
	<1157740873.4979.10.camel@fsol>
	<fb6fbf560609081204t1f4f1a6fo817e533f306f4861@mail.gmail.com>
Message-ID: <1157754250.8948.1.camel@fsol>

Le vendredi 08 septembre 2006 ? 15:04 -0400, Jim Jewett a ?crit :
> > What about seek(), seek_relative() and seek_reverse() ?
> 
> Why not just borrow the standard symbolic names of cur and end?
> 
>     seek(pos=0)
>     seek_cur(pos=0)
>     seek_end(pos=0)

You are right, it's clear and shorter than my proposal.




From jackdied at jackdied.com  Sat Sep  9 01:26:04 2006
From: jackdied at jackdied.com (Jack Diederich)
Date: Fri, 8 Sep 2006 19:26:04 -0400
Subject: [Python-3000] iostack, second revision
In-Reply-To: <1157754250.8948.1.camel@fsol>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>
	<4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org>
	<ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>
	<1157740873.4979.10.camel@fsol>
	<fb6fbf560609081204t1f4f1a6fo817e533f306f4861@mail.gmail.com>
	<1157754250.8948.1.camel@fsol>
Message-ID: <20060908232604.GC6250@performancedrivers.com>

On Sat, Sep 09, 2006 at 12:24:10AM +0200, Antoine Pitrou wrote:
> Le vendredi 08 septembre 2006 ? 15:04 -0400, Jim Jewett a ?crit :
> > > What about seek(), seek_relative() and seek_reverse() ?
> > 
> > Why not just borrow the standard symbolic names of cur and end?
> > 
> >     seek(pos=0)
> >     seek_cur(pos=0)
> >     seek_end(pos=0)

I like the C-ish style because I'm used to it.  These are OK so
long as seek(n, 2) raises an informative exception.  I was initially
going to suggest seek_abs() for the absolute seek but if it remains
plain seek() old users won't have to go searching docs and help()
would be .. helpful.

-Jack

From murman at gmail.com  Sat Sep  9 06:32:10 2006
From: murman at gmail.com (Michael Urman)
Date: Fri, 8 Sep 2006 23:32:10 -0500
Subject: [Python-3000] Help on text editors
In-Reply-To: <4500D990.3000707@blueyonder.co.uk>
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
	<dcbbbb410609071805m7f902564xd6547b19ddec3528@mail.gmail.com>
	<4500D990.3000707@blueyonder.co.uk>
Message-ID: <dcbbbb410609082132y29216edfi6af0a77353dafcdd@mail.gmail.com>

On 9/7/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Yes. However, this is not a good idea for precisely the reason described
> on that page (false detection of Unicode), and so any Unicode detection
> algorithm in Python should only be based on detecting a BOM, IMHO.

Right, except BOMs break tons of Unix applications (and even
occasional Windows ones) which do not expect them. Which leaves us
with Python nearly unable to detect unicode on Unix. This is quite
unfortunate for those of us rooting for UTF-8. Perhaps there are
better heuristics that are worth considering. Perhaps not. It
certainly shouldn't be the default behaviour of a TextFile
constructor.

Michael
-- 
Michael Urman  http://www.tortall.net/mu/blog

From murman at gmail.com  Sat Sep  9 06:39:25 2006
From: murman at gmail.com (Michael Urman)
Date: Fri, 8 Sep 2006 23:39:25 -0500
Subject: [Python-3000] Help on text editors
In-Reply-To: <LEEEILEJNIIMMBJHAKAPGEJNCCAA.jeff@soft.fujitsu.com>
References: <mailman.528.1157676092.5278.python-3000@python.org>
	<LEEEILEJNIIMMBJHAKAPGEJNCCAA.jeff@soft.fujitsu.com>
Message-ID: <dcbbbb410609082139v51275aeeka5cba4107ee949bc@mail.gmail.com>

On 9/7/06, Jeff Wilcox <jeff at soft.fujitsu.com> wrote:
> > From: "Paul Prescod" <paul at prescod.net>
> > 1. On US English Windows, Notepad defaults to an encoding called "ANSI".
> > "ANSI" is not a real encoding at all (and certainly not one from the
> On Japanese Windows 2000, Notepad defaults to ANSI as it does in the English
> version.  It actually writes Shift JIS though.

ANSI is not an encoding; it is a collective name for various multibyte
encodings, each corresponding to a particular default language of the
machine. Thus ANSI corresponds to cp1252 on English and cp932 on
Japanese machines.

As for whether cp932 is the same as Shift JIS, David and I seem to
disagree. While I lack hard data, the string '\\' round trips through
either on my box.
-- 
Michael Urman  http://www.tortall.net/mu/blog

From ncoghlan at gmail.com  Sat Sep  9 07:44:59 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sat, 09 Sep 2006 15:44:59 +1000
Subject: [Python-3000] iostack, second revision
In-Reply-To: <fb6fbf560609081204t1f4f1a6fo817e533f306f4861@mail.gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>	<4501A1B1.5050707@gmail.com>
	<200609081403.18350.fdrake@acm.org>	<ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>	<1157740873.4979.10.camel@fsol>
	<fb6fbf560609081204t1f4f1a6fo817e533f306f4861@mail.gmail.com>
Message-ID: <450254DB.3020502@gmail.com>

Jim Jewett wrote:
> On 9/8/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
>> Le vendredi 08 septembre 2006 ? 11:06 -0700, Guido van Rossum a ?crit :
>>> -1 on those particular cryptic names. Which one of seekby() and
>>> rseek() is the relative seek? Where's the seek relative to EOF?
> 
>> What about seek(), seek_relative() and seek_reverse() ?
> 
> Why not just borrow the standard symbolic names of cur and end?
> 
>     seek(pos=0)
>     seek_cur(pos=0)
>     seek_end(pos=0)
> 
>     seek_end(-1000)    <==> 1000 units (bytes or chars or records or
> ...) before the end
>     seek_cur(50)          <==> 50 units beyond current
>     seek()                     <==> beginning

+1 here. Short, to the point, and easy to remember for anyone already familiar 
with seek().

Cheers,
Nick.

P.S. on a slightly different topic, it would be nice if f.seek(-1) raised 
ValueError instead of IOError. Passing a negative absolute seek value is a 
program bug, not an environment problem.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From david.nospam.hopwood at blueyonder.co.uk  Sat Sep  9 16:39:17 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Sat, 09 Sep 2006 15:39:17 +0100
Subject: [Python-3000] Help on text editors
In-Reply-To: <dcbbbb410609082132y29216edfi6af0a77353dafcdd@mail.gmail.com>
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>	<dcbbbb410609071805m7f902564xd6547b19ddec3528@mail.gmail.com>	<4500D990.3000707@blueyonder.co.uk>
	<dcbbbb410609082132y29216edfi6af0a77353dafcdd@mail.gmail.com>
Message-ID: <4502D215.1080407@blueyonder.co.uk>

Michael Urman wrote:
> On 9/7/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> 
>>Yes. However, this is not a good idea for precisely the reason described
>>on that page (false detection of Unicode), and so any Unicode detection
>>algorithm in Python should only be based on detecting a BOM, IMHO.
> 
> Right, except BOMs break tons of Unix applications (and even
> occasional Windows ones) which do not expect them.

This problem is overstated. A BOM anywhere in a text causes no problem with
display, and *should* be treated as an ignorable character for searching,
etc. Note that there are plenty of other characters that should be treated
as ignorable, so the applications that are broken for BOMs are broken more
generally.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From david.nospam.hopwood at blueyonder.co.uk  Sat Sep  9 17:04:44 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Sat, 09 Sep 2006 16:04:44 +0100
Subject: [Python-3000] Help on text editors
In-Reply-To: <dcbbbb410609082139v51275aeeka5cba4107ee949bc@mail.gmail.com>
References: <mailman.528.1157676092.5278.python-3000@python.org>	<LEEEILEJNIIMMBJHAKAPGEJNCCAA.jeff@soft.fujitsu.com>
	<dcbbbb410609082139v51275aeeka5cba4107ee949bc@mail.gmail.com>
Message-ID: <4502D80C.9030908@blueyonder.co.uk>

Michael Urman wrote:
> On 9/7/06, Jeff Wilcox <jeff at soft.fujitsu.com> wrote:
> 
>>>From: "Paul Prescod" <paul at prescod.net>
>>>1. On US English Windows, Notepad defaults to an encoding called "ANSI".
>>>"ANSI" is not a real encoding at all (and certainly not one from the
>>
>>On Japanese Windows 2000, Notepad defaults to ANSI as it does in the English
>>version.  It actually writes Shift JIS though.
> 
> ANSI is not an encoding; it is a collective name for various multibyte
> encodings, each corresponding to a particular default language of the
> machine. Thus ANSI corresponds to cp1252 on English and cp932 on
> Japanese machines.
> 
> As for whether cp932 is the same as Shift JIS, David and I seem to
> disagree. While I lack hard data, the string '\\' round trips through
> either on my box.

You may have an implementation that uses Cp932 or similar, but calls it
"Shift-JIS". <http://en.wikipedia.org/wiki/Shift_jis> agrees with me, FWIW.

Here is a pretty complete mapping table for Shift-JIS + common extensions
(as opposed to Cp932):

<http://wakaba-web.hp.infoseek.co.jp/table/sjis-0208-1997-std.txt>

although there is quite a bit of variation in mappings:

<http://www.haible.de/bruno/charsets/conversion-tables/Shift_JIS.html>

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From david.nospam.hopwood at blueyonder.co.uk  Sat Sep  9 17:10:38 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Sat, 09 Sep 2006 16:10:38 +0100
Subject: [Python-3000] Help on text editors
In-Reply-To: <dcbbbb410609082139v51275aeeka5cba4107ee949bc@mail.gmail.com>
References: <mailman.528.1157676092.5278.python-3000@python.org>	<LEEEILEJNIIMMBJHAKAPGEJNCCAA.jeff@soft.fujitsu.com>
	<dcbbbb410609082139v51275aeeka5cba4107ee949bc@mail.gmail.com>
Message-ID: <4502D96E.7090607@blueyonder.co.uk>

Michael Urman wrote:
> As for whether cp932 is the same as Shift JIS, David and I seem to
> disagree. While I lack hard data, the string '\\' round trips through
> either on my box.

I missed this part. On any single implementation, '\\' will usually round-trip
from Unicode -> Shift-JIS -> Unicode; the issue is whether it is encoded as
0x5C, or something else like 0x815F. It may very well not round-trip if you
use different implementations for encoding and decoding.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From qrczak at knm.org.pl  Sat Sep  9 17:43:13 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sat, 09 Sep 2006 17:43:13 +0200
Subject: [Python-3000] Help on text editors
In-Reply-To: <4502D215.1080407@blueyonder.co.uk> (David Hopwood's message of
	"Sat, 09 Sep 2006 15:39:17 +0100")
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
	<dcbbbb410609071805m7f902564xd6547b19ddec3528@mail.gmail.com>
	<4500D990.3000707@blueyonder.co.uk>
	<dcbbbb410609082132y29216edfi6af0a77353dafcdd@mail.gmail.com>
	<4502D215.1080407@blueyonder.co.uk>
Message-ID: <87bqpp82r2.fsf@qrnik.zagroda>

David Hopwood <david.nospam.hopwood at blueyonder.co.uk> writes:

>> Right, except BOMs break tons of Unix applications (and even
>> occasional Windows ones) which do not expect them.
>
> This problem is overstated. A BOM anywhere in a text causes no
> problem with display, and *should* be treated as an ignorable
> character for searching, etc.

It is not ignorable in most file formats, and it is not automatically
ignored by reading functions of most programming languages.

> Note that there are plenty of other characters that should be
> treated as ignorable, so the applications that are broken for BOMs
> are broken more generally.

I disagree. UTF-8 BOM should not be used on Unix. It's not a reliable
method of encoding detection in general (applies only to Unicode),
and it breaks the simplicity of text streams.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From paul at prescod.net  Sat Sep  9 19:41:51 2006
From: paul at prescod.net (Paul Prescod)
Date: Sat, 9 Sep 2006 10:41:51 -0700
Subject: [Python-3000] Offtopic: declaring encoding
Message-ID: <1cb725390609091041h24fc67e0g975429210afcabc8@mail.gmail.com>

On 9/9/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
>
> > Note that there are plenty of other characters that should be
> > treated as ignorable, so the applications that are broken for BOMs
> > are broken more generally.
>
> I disagree. UTF-8 BOM should not be used on Unix. It's not a reliable
> method of encoding detection in general (applies only to Unicode),
> and it breaks the simplicity of text streams.


We're offtopic but: treating these decisions as operating-system-specific is
a big part of what caused the current mess. e.g with Japanese Windows users
and Japanese Unix users using different encodings. The Unicode consortium
should address the issue of auto-encoding and make a recommendation for how
"raw" text files can have their encoding detected. A combination of BOM,
coding declaration and fall-back to UTF-8 would cover the vast majority of
the world's languages and incorporate many national encodings.

Are you defending the status quo wherein text data cannot even be reliably
processed on the desktop on which it was created (yes, even on Unix: look
back in this thread). Do you have a positive prescription?

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060909/163a90dc/attachment.html 

From paul at prescod.net  Sat Sep  9 19:58:22 2006
From: paul at prescod.net (Paul Prescod)
Date: Sat, 9 Sep 2006 10:58:22 -0700
Subject: [Python-3000] Help on text editors
In-Reply-To: <LEEEILEJNIIMMBJHAKAPGEJNCCAA.jeff@soft.fujitsu.com>
References: <mailman.528.1157676092.5278.python-3000@python.org>
	<LEEEILEJNIIMMBJHAKAPGEJNCCAA.jeff@soft.fujitsu.com>
Message-ID: <1cb725390609091058j49ffcdc6h61ce7eb80700f011@mail.gmail.com>

On 9/7/06, Jeff Wilcox <jeff at soft.fujitsu.com > wrote:
>
> > From: "Paul Prescod" < paul at prescod.net>
> > 1. On US English Windows, Notepad defaults to an encoding called "ANSI".
> > "ANSI" is not a real encoding at all (and certainly not one from the
> On Japanese Windows 2000, Notepad defaults to ANSI as it does in the
> English
> version.  It actually writes Shift JIS though.
>
> > 2. On my English Mac, the default character set for textedit is "Mac OS
> > Roman". What is it for foreign language macs? What API does an
> application
> > use to query this default character set? What setting is it derived
> from?
> > The Unix-level locale (seems not!) or some GUI-level setting (which
> one)?
>
> Mac OS X actually doesn't have different language versions of the
> operating
> system.  If you change the language setting, the Japanese version
> *becomes*
> the English version and vice versa. (Several of the English speakers that
> I
> work with have purchased Japanese Macs and switched them over to English,
> they're indistinguishable from English Macs afterwards.  Similarly,
> several
> Macs purchased in the US have been successfully switched to Japanese, and
> become indistinguishable from Macs bought in Japan.)


Great: but what is the default Textedit encoding on a Japanized version of
the Mac?

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060909/63de2d1b/attachment.htm 

From qrczak at knm.org.pl  Sat Sep  9 22:00:34 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sat, 09 Sep 2006 22:00:34 +0200
Subject: [Python-3000] Offtopic: declaring encoding
In-Reply-To: <1cb725390609091041h24fc67e0g975429210afcabc8@mail.gmail.com>
	(Paul Prescod's message of "Sat, 9 Sep 2006 10:41:51 -0700")
References: <1cb725390609091041h24fc67e0g975429210afcabc8@mail.gmail.com>
Message-ID: <87wt8c4xp9.fsf@qrnik.zagroda>

"Paul Prescod" <paul at prescod.net> writes:

> text data cannot even be reliably processed on the desktop on which
> it was created (yes, even on Unix: look back in this thread).

Where?

> Do you have a positive prescription?

New communication protocols and newly created file formats designed
for interchange will either specify the text encoding in metadata
(if files are expected to be edited by hand and it's still a near future),
or use UTF-8 exclusively. Simple file formats expected to be used only
locally will continue to have the encoding implicit.

The system encoding of Unix boxes will more commonly be UTF-8 as time
passes.

I'm not using UTF-8 on my desktop by default because there are still
some applications which don't work with UTF-8 terminals. The situation
is much better than it used to be 10 years ago: most applications
didn't support UTF-8 back then, now most do.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From paul at prescod.net  Sun Sep 10 01:26:15 2006
From: paul at prescod.net (Paul Prescod)
Date: Sat, 9 Sep 2006 16:26:15 -0700
Subject: [Python-3000] Offtopic: declaring encoding
In-Reply-To: <87wt8c4xp9.fsf@qrnik.zagroda>
References: <1cb725390609091041h24fc67e0g975429210afcabc8@mail.gmail.com>
	<87wt8c4xp9.fsf@qrnik.zagroda>
Message-ID: <1cb725390609091626r6104a8a2k7604be7b560e7f2f@mail.gmail.com>

On 9/9/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
>
> "Paul Prescod" <paul at prescod.net> writes:
>
> > text data cannot even be reliably processed on the desktop on which
> > it was created (yes, even on Unix: look back in this thread).
>
> Where?


http://mail.python.org/pipermail/python-3000/2006-September/003492.html

New communication protocols and newly created file formats designed
> for interchange will either specify the text encoding in metadata
> (if files are expected to be edited by hand and it's still a near future),
> or use UTF-8 exclusively. Simple file formats expected to be used only
> locally will continue to have the encoding implicit.
>
> The system encoding of Unix boxes will more commonly be UTF-8 as time
> passes.


Okay, thanks for your view of where things are going. I think that it is
clear that UTF-8 will replace iso8859-* on Unix over the next few years. It
isn't as clear if it (or any other global encoding) will replace EUC.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060909/8bd90db9/attachment-0001.html 

From lists at janc.be  Sun Sep 10 01:35:32 2006
From: lists at janc.be (Jan Claeys)
Date: Sun, 10 Sep 2006 01:35:32 +0200
Subject: [Python-3000] Help on text editors
In-Reply-To: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
References: <1cb725390609071221x7ffc9d3epc59ea0381369d3c4@mail.gmail.com>
Message-ID: <1157844932.5109.198.camel@bedsa>

Op do, 07-09-2006 te 12:21 -0700, schreef Paul Prescod:
> Guido has asked me to do some research in aid of a file encoding
> detection/defaulting PEP.
> 
> I only have access to a small number of operating systems and language
> variants so I need help.
> 
> If you have access to "German Windows XP", "Japanese Windows XP",
> "Spanish OS X",  "Japanese OS X", "German Ubuntu" etc., I would
> appreciate answers to the following questions. 
[...]
> 3. In general, how do modern versions of Linux and other Unix handle
> this issue? In particular: what is your default encoding and how did
> your operating system determine it? Did you install a locale-specific
> version? Did the installer ask you? Did you edit a configuration file?
> Did you change a GUI setting? What is the relationship between your
> localization of Gnome/KDE and your default encoding? 

AFAIK Ubuntu has used UTF-8 as the default encoding for all languages
since the 'hoary' release (version 5.04, which was the 2nd Ubuntu
release).


-- 
Jan Claeys


From paul at prescod.net  Sun Sep 10 05:29:05 2006
From: paul at prescod.net (Paul Prescod)
Date: Sat, 9 Sep 2006 20:29:05 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
Message-ID: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>

PEP: XXX
Title: Easy Text File Decoding
Version: $Revision$
Last-Modified: $Date$
Author: Paul Prescod <paul at prescod.net>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 09-Sep-2006
Post-History: 09-Sep-2006
Python-Version: 3.0

Abstract
========

Python 3000 will use Unicode as the standard string type. This means that
text files read from disk will be "decoded" into Unicode code points just as
binary files might be decoded into integers and structures. This change
brings a few issues to the fore that were
previously ignorable.

For example, in Python 2.x, it was possible to open a text file, read the
data into a Python string, filter some lines and print the remaining lines
to the console without ever considering what "encoding" the text was in. In
Python 3000, the programmer will only get access to
Python's powerful string manipulation functions after decoding the data to
Unicode code points. This means that either the programmer or the Python
runtime must select an decoding algorithm (by naming the encoding algorithm
that was used to encode the data in the first place).

Often the programmer can do so based upon out-of-band knowledge ("this file
format is always UCS-2" or "the protocol header says that this data is
latin-1"). In other cases, the programmer may be more naive or simply wish
to avoid thinking about it and would rather defer the issue to Python.

This document presents a proposal for algorithms and APIs that Python can
use to simplify the programmer's life.

Issues outside the scope of this PEP
=====================================

Any programmer who wishes to take direct control of the encoding selection
may of course ignore the features described in this PEP and choose a
decoding explicitly. The PEP is not intended to constrain them in any way.

Bytes received through means other than the file system are not addressed by
this PEP. For example, the PEP does not address data directly read from a
socket or returned from marshal functions.

Rationale
==========

The simplest possible use case for Python text processing involves a user
maintaining some form of simple database (e.g. an address book) as a text
file and processing it with Python. Unfortunately, this use case is not as
simple as it should be because of the variety of encodings in the universe.
For example, the file might be UTF-8, ISO-8859-1 or ISO-8859-2.

Professional programmers making widely distributed programs probably have no
alternative but to deal with this variability head-on. But programmers
working with data that originates and resides primarily on their own
computer might wish to avoid dealing with it. They would like Python to just
"try to do the right" thing with respect to the file. They would like to
think about encodings if and only if Python failed to guess appropriately.

Proposal
========

The function to open a text file will tenatively be called textfile(),
though the function name is not an integral part of this PEP. The function
takes three arguments, the filename, the mode ("r", "w", "r+", etc.) and the
type.

The type could be a true encoding or one of a small set of additional
symbolic values. The two main symbolic values are:

* "site" -- the default value, which invokes a site-specific alogrithm. For
example, a Japanese school teacher using Windows might default "site" to
Shift-JIS. An organization dealing with a small number of encodings might
default "site" to be equivalent to "guess". An organization with a strict
internationalization policy might default "site" to "UTF-8". An important
open issue is what Python's out-of-box interpretation of "site" should be.
This is key because "site" is the default value so Python's out-of-box
behaviour is the "default default".

* "guess" -- the value to be used by encoding-inexpert programmers and
experts who feel confident that Python's guessing algorithm will produce
sufficient results for their purposes. The guessing algorithm will
necessarily be complicated and may change over time. It will take into
account the following factors:

   - the conventions dominant on the operating system of choice

   - any localization-relevant settings available

   - a certain number of bytes at the start of the file (perhaps start and
end?). This sample will likely be on the order of thousands of bytes.

   - filesystem metadata attached to the file (in strong preference to the
above).

* "locale" -- the encoding suggested by the operating system's locale
concept

Other symbolic values might allow the programmer to suggest specific
encoding detection algorithms like XML [#XML-encoding-detection]_, HTML
[#HTML-encoding-detection]_ and the "coding:" comment convention. These
would be specified in separate PEPs.

The Site Decoding Hook
========================

The "sys" module could have a function called "setdefaultfileencoding". The
encoding specified could be a true encoding name or one of the encoding
detection scheme names (e.g. "guess" or "XML").

In addition, it should be possible to register new encoding detection
schemes using a method like "sys.registerencodingdetector". This function
would take two arguments, a string and a callable. The callable would accept
a byte stream argument and return a text stream. The contract for these
detection scheme implementations must allow them to peek ahead some bytes to
use the content as a hint to the encoding.

Alternatives and Open Issues
==============================

1. Guido proposes that the function be called merely "open". His proposal is
that the binary open should be the alternative and should be invoked
explicitly with a "b" mode switch. The PEP author feels first, that changing
the behaviour of an existing function is more confusing and disruptive than
creating another. Backporting a change to the "open" function would be
difficult and therefore it would be unnecessarily difficult to create
file-manipulating libraries that work both on Python 2.x and 3.x.

Second, the author feels that the "open" is an unnecessarily cryptic name
based only in Unix/C history. For a programmer coming from (for example)
Javascript, open() would tend to imply "open window". The PEP author
believes that factory functions should say what they are creating.

2. There is substantial disagreement on the behaviour of the function when
there is no encoding argument passed and no site override (i.e the
out-of-box default). Current proposals include ASCII (on the basis that it
is a nearly universal subset of popular encodings), UTF-8 (on the basis that
it is the dominant global standard encompassing all of Unicode), a
locale-derived encoding (on the basis that this is what a naive user will
generate in a text editor) or the guessing algorithm (on the basis that it
is by definition designed to guess right more often than any more specific
encoding name).

The PEP author strongly advocates a strict encoding like ASCII, UTF-8 or no
default at all (in which case the lack of an encoding would raise an
exception). A default like iso-8859-1 (even inferred from the environment)
will result in encodings like UTF-8, UCS-2 and even binary files being
"interpreted" as gibberish strings. This could result in document or
database corruption. An encoding with a "guess" default will encourage the
widespread creation of very unreliable code.

The current proposal is to have no out-of-box default until some point in
the future when a small set of auto-detectable encodings are globally
dominant. UTF-8 has gradually been gaining popularity through W3C and other
standards so it is possible that five years from now it will be the
"no-brainer" default. Until we can guess with substantial confidence,
absence of both an encoding declaration and a site override should result in
a thrown exception.

References
==========

.. [#XML-encoding-detection] XML Encoding Detection algorithm:
http://www.w3.org/TR/REC-xml/#sec-guessing
.. [#HTML-encoding-detection] HTML Encoding Detection algorithm:
http://www.w3.org/TR/REC-xml/#sec-guessing

Copyright
=========

This document has been placed in the public domain.



..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060909/38766c07/attachment-0001.htm 

From greg.ewing at canterbury.ac.nz  Sun Sep 10 08:11:05 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sun, 10 Sep 2006 18:11:05 +1200
Subject: [Python-3000] The future of exceptions
In-Reply-To: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com>
References: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com>
Message-ID: <4503AC79.4090601@canterbury.ac.nz>

Michael Chermside wrote:
> It doesn't necessarily imply that
> the traceback be materialized immediately upon exception creation
> (which is undesirable because we want exceptions lightweight enough
> to use for things like for loop control!)... but it might mean that
> pieces of the stack frame need hang around as long as the exception
> itself does.

With the current implementation, "materialising the traceback"
and "keeping parts of the stack frame hanging around" are
pretty much the same thing, since the traceback is mostly just
a linked list of frames encountered while unwinding the stack
looking for a handler. So if there's a possibility you might
want a traceback at all at any point, it's hard to see how
the process could be made any more lightweight.

However, I'm wondering whether it might be worth distinguishing
two different kinds of exceptions: "flow control" exceptions
which are used something like a non-local goto, and full-
blown exceptions. Flow control exceptions typically don't
need most of the exception machinery -- they don't carry
data of their own, so you don't need to instantiate a class
every time, and you're not usually interested in a traceback.
So maybe there should be a different form of raise statement
for these that doesn't bother making provision for them.

A problem is that if a flow control exception *doesn't* get
caught by something that's expecting it, you probably do
want a traceback in order to debug the problem.

Maybe try-statements could maintain a stack of handlers,
so the raise-control-flow-exception statement could quickly
tell whether there is a handler, and if not, raise an
ordinary exception with a traceback.

Or maybe there should be a different mechanism altogether
for non-local gotos. I'd like to see some kind of "longjmp"
object that could be invoked to cause a jump back to
a specific place. That would help alleviate the problem
that exceptions used for control flow can get caught by
the wrong handler. Sometimes you really want something
that's targeted to a specific handler, not just the next
enclosing one of some type.

--
Greg

From qrczak at knm.org.pl  Sun Sep 10 11:11:31 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sun, 10 Sep 2006 11:11:31 +0200
Subject: [Python-3000] The future of exceptions
In-Reply-To: <4503AC79.4090601@canterbury.ac.nz> (Greg Ewing's message of
	"Sun, 10 Sep 2006 18:11:05 +1200")
References: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com>
	<4503AC79.4090601@canterbury.ac.nz>
Message-ID: <874pvgozlo.fsf@qrnik.zagroda>

Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

> Flow control exceptions typically don't need most of the exception
> machinery -- they don't carry data of their own, so you don't need
> to instantiate a class every time,

It's lazily instantiated today (see PyErr_NormalizeException).

> Or maybe there should be a different mechanism altogether
> for non-local gotos. I'd like to see some kind of "longjmp"
> object that could be invoked to cause a jump back to
> a specific place.

Any non-local exit should be hookable by active function calls between
the raising point and the catching point, especially by things like
try...finally.

> Sometimes you really want something that's targeted to a specific
> handler, not just the next enclosing one of some type.

Indeed, but this can still use an exception internally. My language
Kogut has a function for that ('?' is lambda, the whole thing is an
argument of 'WithExit'):

WithExit ?exit {
   some code
   which can at some point
   call the 'exit' function introduced above,
   even from another function,
   and the control flow will return to this WitExit call
};

I think it can be exposed as something used with 'with' in Python.
'WithExit' constructs a unique exception object and catches precisely
this object.

Implementing it with an exception makes the semantics of expression
evaluation more uniform: an expression either evaluates to a value,
or fails with an exception, and there is no other possibility which
would have to be accounted for in generic wrappers which call unknown
code (e.g. my bridge between two languages, or running a computation
by another thread).

There are other kinds of non-local exits, like exiting the program
or thread cancellation, which can be implemented with exceptions and
I think it's better than inventing a separate mechanism for each.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From solipsis at pitrou.net  Sun Sep 10 12:31:24 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sun, 10 Sep 2006 12:31:24 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
Message-ID: <1157884285.4246.41.camel@fsol>

Le samedi 09 septembre 2006 ? 20:29 -0700, Paul Prescod a ?crit :
> The type could be a true encoding or one of a small set of additional
> symbolic values. The two main symbolic values are:

Actually your proposal has three ;)

> For example, a Japanese school teacher using Windows might default
> "site" to Shift-JIS.

I think a Japanese school teacher using Windows shouldn't have to
configure anything specifically in Python, encoding-wise. 
I've never seen a tool (e.g. text editor) refuse to work before you had
explicitly configured an encoding *for the tool*. Those tools either
choose system-wide default aka "locale" (if they want to play fair with
other apps) or their own (if they think utf-8 is the future).

I see two cases where refusing to use a default is even more unhelpful:
- on the growing number of systems which have utf-8 as default
- when the programmer simply wants to open a pure-ascii text file (e.g.
configuration file), and opening it as text allows him to read it
line-by-line, or use whatever other facilities text files provide that
binary files don't


So, here is an alternative proposal :
Make it so that textfile() doesn't recognize system-wide defaults (as in
your proposal), but also provide autotextfile() which would recognize
those defaults (with a by_content=False optional argument to enable
content-based guessing).

textfile() being clearly marked for use by large well thought-out
applications, and autotextfile() for small scripts and the like.
Different names make it clear that they are for different uses, and
allow to spot them easily when looking at source code (either by a human
reader or a quality measurement tool).

Regards

Antoine.



From phd at mail2.phd.pp.ru  Sun Sep 10 12:35:00 2006
From: phd at mail2.phd.pp.ru (Oleg Broytmann)
Date: Sun, 10 Sep 2006 14:35:00 +0400
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
Message-ID: <20060910103500.GA13412@phd.pp.ru>

On Sat, Sep 09, 2006 at 08:29:05PM -0700, Paul Prescod wrote:
> "the protocol header says that this data is latin-1").

   "Protocol metadata" if you allow me to suggest the word.

Oleg.
-- 
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From solipsis at pitrou.net  Sun Sep 10 13:02:57 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sun, 10 Sep 2006 13:02:57 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
Message-ID: <1157886177.4246.59.camel@fsol>


> The Site Decoding Hook
> ======================== 
> 
> The "sys" module could have a function called
> "setdefaultfileencoding". The encoding specified could be a true
> encoding name or one of the encoding detection scheme names ( e.g.
> "guess" or "XML").

Isn't it more intuitive to gather functions based on what their
high-level purpose is ("text" or "textfile") than implementation details
of where the information comes from ("sys", "locale") ?

That function could be "textfile.set_default_encoding" (with
underscores), or even "text.textfile.set_default_encoding" (if all this
resides in a "text" module).

Regards

Antoine.



From ncoghlan at gmail.com  Sun Sep 10 13:58:00 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sun, 10 Sep 2006 21:58:00 +1000
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1157884285.4246.41.camel@fsol>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol>
Message-ID: <4503FDC8.2030608@gmail.com>

Antoine Pitrou wrote:
> So, here is an alternative proposal :
> Make it so that textfile() doesn't recognize system-wide defaults (as in
> your proposal), but also provide autotextfile() which would recognize
> those defaults (with a by_content=False optional argument to enable
> content-based guessing).
> 
> textfile() being clearly marked for use by large well thought-out
> applications, and autotextfile() for small scripts and the like.
> Different names make it clear that they are for different uses, and
> allow to spot them easily when looking at source code (either by a human
> reader or a quality measurement tool).

How does your "autotextfile('myfile.txt')" differ from Paul's 
"textfile('myfile.txt', encoding='guess')"?

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From ncoghlan at gmail.com  Sun Sep 10 14:05:35 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sun, 10 Sep 2006 22:05:35 +1000
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
Message-ID: <4503FF8F.6070801@gmail.com>

Paul Prescod wrote:
> The function to open a text file will tenatively be called textfile(), 
> though the function name is not an integral part of this PEP. The 
> function takes three arguments, the filename, the mode ("r", "w", "r+", 
> etc.) and the type.
> 
> The type could be a true encoding or one of a small set of additional 
> symbolic values.

The 'additional symbolic values' should be implemented as true encodings 
(i.e., it should be possible to look up 'site', 'guess' and 'locale' in the 
codecs registry, and replace them there as well).

I also agree with Guido that the right spelling for the factory function is to 
incorporate this into the existing open() builtin. The signature of open() is 
already going to change to accept an encoding argument in Py3k, and the 
special encodings proposed in the PEP are just that: special encodings that 
happen to take environmental information into account when deciding how to 
decode or encode data.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From solipsis at pitrou.net  Sun Sep 10 14:47:15 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sun, 10 Sep 2006 14:47:15 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <4503FDC8.2030608@gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol>  <4503FDC8.2030608@gmail.com>
Message-ID: <1157892435.4246.107.camel@fsol>

Le dimanche 10 septembre 2006 ? 21:58 +1000, Nick Coghlan a ?crit :
> Antoine Pitrou wrote:
> > So, here is an alternative proposal :
> > Make it so that textfile() doesn't recognize system-wide defaults (as in
> > your proposal), but also provide autotextfile() which would recognize
> > those defaults (with a by_content=False optional argument to enable
> > content-based guessing).
> > 
> > textfile() being clearly marked for use by large well thought-out
> > applications, and autotextfile() for small scripts and the like.
> > Different names make it clear that they are for different uses, and
> > allow to spot them easily when looking at source code (either by a human
> > reader or a quality measurement tool).
> 
> How does your "autotextfile('myfile.txt')" differ from Paul's 
> "textfile('myfile.txt', encoding='guess')"?

Paul's "encoding='guess'" specifies a complicated and dangerous guessing
algorithm.

However, autotextfile('myfile.txt') would mean :
- use Paul's "site" if such a thing is defined
- otherwise, use Paul's "locale"
(no content-based guessing)

On the other hand "autotextfile('myfile.txt', by_content=True)" would
enable content-based guessing, thus be equivalent to Paul's
"encoding='guess'".

To sum up the API:
 - textfile("filename.txt", mode, encoding=None): fails without  an
explicit "encoding" argument if no "site" algorithm has been explicitly
configured.
 - autotextfile("filename.txt", mode, by_content=False): selects either
the "site"-configured encoding or the locale fallback, unless
"by_content" is True in which case it tries to detect based on actual
content.


In short, my proposal is just a naming proposal to achieve the following
goals :
- the textfile() function is "clean", and satisfies to the ideal that it
is Wrong to not specify an encoding when retrieving text from on-disk
bytes
- the autotextfile() function makes it easy to write simple scripts with
an easy to remember function with an explicit name (instead of a magic
value in an optional string argument)
- the autotextfile() function makes it easy to spot those abusive uses
of the quick-and-dirty way in apps which strive for interoperability and
portability

(in French we say "ne pas m?langer les torchons et les serviettes" :
don't mix towels and rags :-))

All this can be in a module, no need to pollute the top-level
namespace :
from text import textfile
from text import autotextfile


> The 'additional symbolic values' should be implemented as true
> encodings (i.e., it should be possible to look up 'site', 'guess' and
> 'locale' in the codecs registry, and replace them there as well).

Treating different things as "true encodings" does not help
understandability IMHO. "guess", "site" and "locale" are not encodings
in themselves, they are decision algorithms. In particular, "guess" has
to look at big chunks of existing text contents before deciding (which
may or may not have side-effects such as unexpected buffering).

Really, while "iso-8859-1" or "utf-8" is always the same encoding,
"guess" will not always result in the same encoding being used: it
depends on actual data fed to it. "guess" will not even allow the same
set of characters to be used: if "guess" results in "iso-8859-1", then I
can't use all the (Unicode) characters that I can use when "guess"
results in "utf-8".
This variability/unpredictability is a fundamental difference in
behaviour compared to a "true encoding", for which you can always be
sure what set of (textual) data can be represented.

Regards

Antoine.



From solipsis at pitrou.net  Sun Sep 10 15:21:14 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sun, 10 Sep 2006 15:21:14 +0200
Subject: [Python-3000] encoding='guess' ?
In-Reply-To: <1157892435.4246.107.camel@fsol>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol>  <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol>
Message-ID: <1157894475.4246.130.camel@fsol>


Hi,

Let me add that 'guess' should probably be forbidden as an encoding
parameter (instead, a separate function argument should be used as in my
proposal).

Here is a schematic example to show why :

def append_text(filename, encoding):
    src = textfile(filename, "r", encoding)
    my_text = src.read()
    src.close()
    dst = textfile("textlist.txt", "r+", encoding)
    dst.seek_end(0)
    dst.write(my_text + "\n")
    dst.close()

With Paul's current proposal three cases can arise :
 - "encoding" is a real encoding name like iso-8859-1 or utf-8. There
should be no problems, since we assume this encoding has been configured
once and for all in the application.
 - "encoding" is either "site" or "locale". This should result in the
same value run after run, since we assume the site or locale encoding
value has been configured once and for all.
 - "encoding" is "guess". In this case anything can happen. A possible
occurence is that for the first file, it will result in utf-8 being
detected (or Shift-JIS, or whatever), and for the second file it will be
iso-8859-1. This will lead to a crash in the likely case that some
characters in the source file can't be represented using the character
encoding auto-detected for the destination file.

Yet the append_text() function does look correct, doesn't it?

We shouldn't hide a contextual encoding-detection algorithm under an
encoding name. It leads to semantic uncertainty.

Regards

Antoine.



From ncoghlan at gmail.com  Sun Sep 10 15:44:16 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sun, 10 Sep 2006 23:44:16 +1000
Subject: [Python-3000] encoding='guess' ?
In-Reply-To: <1157894475.4246.130.camel@fsol>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>	<1157884285.4246.41.camel@fsol>
	<4503FDC8.2030608@gmail.com>	<1157892435.4246.107.camel@fsol>
	<1157894475.4246.130.camel@fsol>
Message-ID: <450416B0.4050109@gmail.com>

Antoine Pitrou wrote:
> Hi,
> 
> Let me add that 'guess' should probably be forbidden as an encoding
> parameter (instead, a separate function argument should be used as in my
> proposal).
> 
> Here is a schematic example to show why :
> 
> def append_text(filename, encoding):
>     src = textfile(filename, "r", encoding)
>     my_text = src.read()
>     src.close()
>     dst = textfile("textlist.txt", "r+", encoding)
>     dst.seek_end(0)
>     dst.write(my_text + "\n")
>     dst.close()
> 
> With Paul's current proposal three cases can arise :
>  - "encoding" is a real encoding name like iso-8859-1 or utf-8. There
> should be no problems, since we assume this encoding has been configured
> once and for all in the application.
>  - "encoding" is either "site" or "locale". This should result in the
> same value run after run, since we assume the site or locale encoding
> value has been configured once and for all.
>  - "encoding" is "guess". In this case anything can happen. A possible
> occurence is that for the first file, it will result in utf-8 being
> detected (or Shift-JIS, or whatever), and for the second file it will be
> iso-8859-1. This will lead to a crash in the likely case that some
> characters in the source file can't be represented using the character
> encoding auto-detected for the destination file.
> 
> Yet the append_text() function does look correct, doesn't it?
> 
> We shouldn't hide a contextual encoding-detection algorithm under an
> encoding name. It leads to semantic uncertainty.

Interesting. This goes back more towards the model of "no default encoding, 
but provide the right tools to make it easy for a program to choose one in the 
absence of any metadata".

So perhaps there should just be an explicit function "guessencoding()" that 
accepts a filename and returns a codec name. So if you want to guess, you 
would do something like:

f = open(fname, 'r', string.guessencoding(fname))

The PEP's other suggestions would then be spelled something like:

f = open(fname, 'r', string.getlocaleencoding())
f = open(fname, 'r', string.getsiteencoding())

Cheers,
Nick.



-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From david.nospam.hopwood at blueyonder.co.uk  Sun Sep 10 15:52:44 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Sun, 10 Sep 2006 14:52:44 +0100
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1157892435.4246.107.camel@fsol>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>	<1157884285.4246.41.camel@fsol>
	<4503FDC8.2030608@gmail.com> <1157892435.4246.107.camel@fsol>
Message-ID: <450418AC.2010400@blueyonder.co.uk>

Antoine Pitrou wrote:
> Le dimanche 10 septembre 2006 ? 21:58 +1000, Nick Coghlan a ?crit :
>>Antoine Pitrou wrote:
>>
>>>So, here is an alternative proposal :
>>>Make it so that textfile() doesn't recognize system-wide defaults (as in
>>>your proposal), but also provide autotextfile() which would recognize
>>>those defaults (with a by_content=False optional argument to enable
>>>content-based guessing).
>>>
>>>textfile() being clearly marked for use by large well thought-out
>>>applications, and autotextfile() for small scripts and the like.
>>>Different names make it clear that they are for different uses, and
>>>allow to spot them easily when looking at source code (either by a human
>>>reader or a quality measurement tool).
>>
>>How does your "autotextfile('myfile.txt')" differ from Paul's 
>>"textfile('myfile.txt', encoding='guess')"?
> 
> Paul's "encoding='guess'" specifies a complicated and dangerous guessing
> algorithm.

Indeed, to the extent that it specifies anything. However, guessing algorithms
can differ greatly in how complicated and dangerous they are.

Here is a very simple, reasonably (although not completely) safe, and much
more predictable guessing algorithm, based on a generalization of
<http://www.w3.org/TR/REC-xml/#sec-guessing>:

   Let A, B, C, and D be the first 4 bytes of the stream, or None if the
     corresponding byte is past end-of-stream.

   Let other be any encoding which is to be used as a default if no specific
     UTF is detected.

   if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8
   if B == None: return other
   if A == 0 and B == 0 and D != None: return UTF32BE
   if C == 0 and D == 0: return UTF32LE
   if A == 0xFE and B == 0xFF: return UTF16BE
   if A == 0xFF and B == 0xFE: return UTF16LE
   if A != 0 and B != 0: return other
   if A == 0: return UTF16BE
   return UTF16LE

This would normally be used with 'other' as the system encoding, as an alternative
to just assuming that the file is in the system encoding.

There is very little chance of this algorithm misdetecting a file in a non-Unicode
encoding as Unicode. For that to happen, either the first two or three bytes would
have to be encoded in exactly the same way as a UTF-16 or UTF-8 BOM, or one of the
first three characters would have to be NUL.

However, if the file *is* Unicode and it starts with a BOM, then its UTF will
always be correctly detected.

Furthermore, UTF-16 and UTF-32 will be correctly detected if the file starts with
a character from U+0001 to U+00FF (i.e. non-NUL and in the ISO-8859-1 range).

Another advantage of this algorithm is that it always reads only 4 bytes.

> However, autotextfile('myfile.txt') would mean :
> - use Paul's "site" if such a thing is defined
> - otherwise, use Paul's "locale"
> (no content-based guessing)
> 
> On the other hand "autotextfile('myfile.txt', by_content=True)" would
> enable content-based guessing, thus be equivalent to Paul's
> "encoding='guess'".

As I pointed out earlier, any file open function that guesses the encoding
should return which encoding has been guessed. Alternatively, it could be
possible to allow the encoding to be set after the file has been opened,
in which case a separate function could do the guessing.

>>The 'additional symbolic values' should be implemented as true
>>encodings (i.e., it should be possible to look up 'site', 'guess' and
>>'locale' in the codecs registry, and replace them there as well).
> 
> Treating different things as "true encodings" does not help
> understandability IMHO. "guess", "site" and "locale" are not encodings
> in themselves, they are decision algorithms.

+1.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>




From solipsis at pitrou.net  Sun Sep 10 16:00:06 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sun, 10 Sep 2006 16:00:06 +0200
Subject: [Python-3000] encoding='guess' ?
In-Reply-To: <450416B0.4050109@gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol>  <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol> <1157894475.4246.130.camel@fsol>
	<450416B0.4050109@gmail.com>
Message-ID: <1157896806.4246.138.camel@fsol>


Le dimanche 10 septembre 2006 ? 23:44 +1000, Nick Coghlan a ?crit :
> Interesting. This goes back more towards the model of "no default encoding, 
> but provide the right tools to make it easy for a program to choose one in the 
> absence of any metadata".

In the "clean" API yes.
But it would be nice to also have an easy API for small scripts, hence
my "autotextfile" proposal.
(and, it would also avoid making life too hard for beginners trying to
learn the language)

> f = open(fname, 'r', string.guessencoding(fname))

This one is inefficient because it results in opening the file twice:
once in string.guessencoding(), and once in open().
This does not happen if there is a special argument instead, like
"by_content=True" in my proposal.

Cheers 

Antoine.



From solipsis at pitrou.net  Sun Sep 10 16:04:47 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sun, 10 Sep 2006 16:04:47 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <450418AC.2010400@blueyonder.co.uk>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol>  <450418AC.2010400@blueyonder.co.uk>
Message-ID: <1157897087.4246.143.camel@fsol>

Le dimanche 10 septembre 2006 ? 14:52 +0100, David Hopwood a ?crit :
> > On the other hand "autotextfile('myfile.txt', by_content=True)" would
> > enable content-based guessing, thus be equivalent to Paul's
> > "encoding='guess'".
> 
> As I pointed out earlier, any file open function that guesses the encoding
> should return which encoding has been guessed.

Since open files are objects, the encoding can just be a read-only
property:

# replace autotextfile by whatever API is finally chosen ;)
f = autotextfile('myfile.txt', by_content=True)
enc = f.encoding

> Alternatively, it could be possible to allow the encoding to be set
> after the file has been opened, in which case a separate function
> could do the guessing.

Yes, sounds like a nice alternative.

Regards

Antoine.



From solipsis at pitrou.net  Sun Sep 10 16:27:12 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sun, 10 Sep 2006 16:27:12 +0200
Subject: [Python-3000] sys.stdin and sys.stdout with textfile
Message-ID: <1157898432.4246.161.camel@fsol>


Hi,

Another aspect of the textfile discussion.
sys.stdin and sys.stdout are for now, concretely, byte streams (AFAIK,
at least under Unix). Yet it must be possible to read/write text to and
from them.

So two questions:
 - Is there a builtin text.stdin / text.stdout counterpart to
sys.stdin / sys.stdout (the former being text versions, the latter raw
bytes versions) ?
Or a way to write: my_input_file = textfile(sys.stdin) ?
 - How is handled the default encoding ?
Does Python mandate setting an encoding before calling print() or
raw_input() ?

Also, consider a "script.py" beginning with:

import sys, text
if len(sys.argv) > 1:
    f = textfile(sys.argv[1], "r")
else:
    f = text.stdin

Should encoding policy be chosen differently depending on whether the
script is called with:
    python script.py in.txt
or with:
    python script.py < in.txt
?

Regards

Antoine.



From qrczak at knm.org.pl  Sun Sep 10 18:08:14 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sun, 10 Sep 2006 18:08:14 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	(Paul Prescod's message of "Sat, 9 Sep 2006 20:29:05 -0700")
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
Message-ID: <87d5a366xd.fsf@qrnik.zagroda>

"Paul Prescod" <paul at prescod.net> writes:

> The type could be a true encoding or one of a small set of additional
> symbolic values. The two main symbolic values are:

Here is a counter-proposal.

There is a variable sys.default_encoding. It's used by file opening
functions when the encoding is not specified explicitly, among others.
Its initial value is set in site.py with a site-specific algorithm.

Two variants of the proposal:

1. The default site-specific algorithm queries the locale on Unix,
   uses "mbcs" on Windows (which is a special encoding which causes
   to use MultiByteToWideChar as the decoding function), and something
   appropriate on other systems.

2. The default initial value is "locale" (or "system" or "default" or
   whatever, but the spelling is fixed), which is a special encoding
   name which means to use the system-specific encoding, as above.

I prefer variant 1: it's simpler and it allows programs to examine the
choice on Unix.

A Python-specific environment variable could be defined to override
the system-specific choice.

If MultiByteToWideChar on Windows doesn't handle UTF-8 even with a BOM
(I don't know whether it does), then the Windows default could be an
encoding which assumes UTF-8 when a UTF-8 BOM is present, and uses
MultiByteToWideChar otherwise. This applies only to Windows; Unix
rarely uses a BOM, OTOH on Unix you can have UTF-8 locales which
Windows doesn't have as far as I know.

Other than that, guessing the encoding from the contents of the text
stream, especially statistical guessing basing on well-formed UTF-8
non-ASCII characters, shouldn't be encouraged, because it's effect is
not predictable. There can be a separate function which guesses the
encoding for those who really want to do this.

If Python ever has dynamically-scoped variables, sys.default_encoding
should be dynamically scoped, so it's possible to set for the context
of a block of code.

sys.default_encoding also applies to filenames, to names and values of
environment variables, to program invocation parameters (both sys.argv
and os.exec*), to pwd.struct_passwd.pw_gecos, etc. There is a number
of Unix interfaces which doesn't specify the encoding of texts they
exchange (and of course pw_gecos doesn't contain a BOM if it's UTF-8).


Antoine Pitrou <solipsis at pitrou.net> writes:

> sys.stdin and sys.stdout are for now, concretely, byte streams (AFAIK,
> at least under Unix). Yet it must be possible to read/write text to and
> from them.

Here is what my language Kogut does this:

RawStdIn etc. are the underlying raw files (thin wrappers over file
descriptors). StdIn etc. are text files with encoding, buffering etc.
They are initialized the first time they are used, i.e. the first time
the StdIn variable is read. They are constructed with the default
encoding from that time.

This allows a script to set the default encoding before accessing
standard text streams.

I don't know wheter Python typically accesses stdin/stdout during
initialization, before the first line of the script is executed.
If it does, this design can't be used until this is changed.

> Also, consider a "script.py" beginning with:
>
> import sys, text
> if len(sys.argv) > 1:
>     f = textfile(sys.argv[1], "r")
> else:
>     f = text.stdin
>
> Should encoding policy be chosen differently depending on whether the
> script is called with:
>     python script.py in.txt
> or with:
>     python script.py < in.txt
> ?

With my design it's the same. It's also the same if the script does
sys.default_encoding = 'ISO-8859-1' at the beginning.

Note: in my design sys.argv is also initialized lazily (in fact each
time it is accessed, until it's assigned to where it starts to behave
as a normal variable).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From guido at python.org  Sun Sep 10 19:04:56 2006
From: guido at python.org (Guido van Rossum)
Date: Sun, 10 Sep 2006 10:04:56 -0700
Subject: [Python-3000] sys.stdin and sys.stdout with textfile
In-Reply-To: <1157898432.4246.161.camel@fsol>
References: <1157898432.4246.161.camel@fsol>
Message-ID: <ca471dc20609101004t2d55b686x4908c39981467106@mail.gmail.com>

On 9/10/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Another aspect of the textfile discussion.
> sys.stdin and sys.stdout are for now, concretely, byte streams (AFAIK,
> at least under Unix).

No, they are conceptually text streams, because that's what they are
on Windows, which the only remaining platform where you can currently
experience the difference between text and byte streams.

> Yet it must be possible to read/write text to and
> from them.

I'd turn it around. If you want to read bytes from stdin (sometimes a
useful thing for filters), in Py3k you better dig out the underlying
byte stream and use that.

> So two questions:
>  - Is there a builtin text.stdin / text.stdout counterpart to
> sys.stdin / sys.stdout (the former being text versions, the latter raw
> bytes versions) ?

You've got it backwards.

> Or a way to write: my_input_file = textfile(sys.stdin) ?
>  - How is handled the default encoding ?
> Does Python mandate setting an encoding before calling print() or
> raw_input() ?

Not in my view of the future. :-)

> Also, consider a "script.py" beginning with:
>
> import sys, text
> if len(sys.argv) > 1:
>     f = textfile(sys.argv[1], "r")
> else:
>     f = text.stdin
>
> Should encoding policy be chosen differently depending on whether the
> script is called with:
>     python script.py in.txt
> or with:
>     python script.py < in.txt
> ?

All sorts of things are different when reading stdin vs. opening a
filename. e.g. stdin may be a pipe.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Sun Sep 10 19:09:17 2006
From: guido at python.org (Guido van Rossum)
Date: Sun, 10 Sep 2006 10:09:17 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <4503FF8F.6070801@gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<4503FF8F.6070801@gmail.com>
Message-ID: <ca471dc20609101009x3ebe0482x5ceb786e748558b@mail.gmail.com>

On 9/10/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> The 'additional symbolic values' should be implemented as true encodings
> (i.e., it should be possible to look up 'site', 'guess' and 'locale' in the
> codecs registry, and replace them there as well).

That's hard to do since guessing, at least, may require inspection of
a large portion of the input data before settling upon a specific
choice. The decoding API doesn't have a way to do this AFAIK. And for
encoding (output) it's even more iffy -- if possible I'd like the
guessing function to have access to what was in the file before it was
emptied by the "create" function, or what's at the start before
appending to the end,

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido at python.org  Sun Sep 10 19:11:46 2006
From: guido at python.org (Guido van Rossum)
Date: Sun, 10 Sep 2006 10:11:46 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <87d5a366xd.fsf@qrnik.zagroda>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<87d5a366xd.fsf@qrnik.zagroda>
Message-ID: <ca471dc20609101011n3c8d57e4w1931462097d246b5@mail.gmail.com>

On 9/10/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
> Here is a counter-proposal.
>
> There is a variable sys.default_encoding. It's used by file opening
> functions when the encoding is not specified explicitly, among others.
> Its initial value is set in site.py with a site-specific algorithm.

This doesn't seem to allow guessing based on the file's contents. That
seems intentional from your part, but I believe it makes for way too
many disappointing user experiences.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From 2006 at jmunch.dk  Sun Sep 10 19:53:09 2006
From: 2006 at jmunch.dk (Anders J. Munch)
Date: Sun, 10 Sep 2006 19:53:09 +0200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <450254DB.3020502@gmail.com>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>	<4501A1B1.5050707@gmail.com>	<200609081403.18350.fdrake@acm.org>	<ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>	<1157740873.4979.10.camel@fsol>	<fb6fbf560609081204t1f4f1a6fo817e533f306f4861@mail.gmail.com>
	<450254DB.3020502@gmail.com>
Message-ID: <45045105.5040209@jmunch.dk>

Nick Coghlan wrote:
 > Jim Jewett wrote:
 >> Why not just borrow the standard symbolic names of cur and end?
 >>
 >>     seek(pos=0)
 >>     seek_cur(pos=0)
 >>     seek_end(pos=0)

I say drop seek_cur and seek_end altogether, and keep only absolute
seek.

The C library caters for archaic file systems, that are record-based
or otherwise not well modelled as an array of bytes.  That's where the
ftell/fseek/fpos_t system comes from: An fpos_t might be a composite
data type containing a record number and a within-record offset; but as
long as it's used as an opaque token, you'd never notice.

That was a nice design for backward-compatibility back in the early
1970's.  Thirty years later do we still need it?  POSIX and Win32 have
array-of-bytes files.  Does CPython even run on any OS where binary
files are not seen as arrays of bytes?  I'm saying _binary_ files
because a gander through the standard library shows that seeking is
never done on text files.  Even mailbox.py opens Unix mailbox files as
binary.

The majority of f.seek(.., 2) calls in the library use it for computing
the length of file.  How's that for an "opaque token": f.tell() is
taken to be the length of the file after f.seek(0,2).

As for seeking to the end with only an absolute .seek available:
Surely, any file that supports seeking to the end will also support
reporting the file size.  Thus
  f.seek(f.length)
should suffice, and what could be clearer?  Also, there's the "a+w"
mode for appending, no seeks required.

Having just a single method/mode will not only ease file-protocol
implementation, but IMO client code will be easier to read as well.

- Anders

PS: I'm working on that FileBytes object, Tomer, as a wrapper over an
object that supports seek to absolute position, with integrated
buffering.


From jcarlson at uci.edu  Sun Sep 10 20:08:41 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sun, 10 Sep 2006 11:08:41 -0700
Subject: [Python-3000] The future of exceptions
In-Reply-To: <4503AC79.4090601@canterbury.ac.nz>
References: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com>
	<4503AC79.4090601@canterbury.ac.nz>
Message-ID: <20060910110221.F8EA.JCARLSON@uci.edu>


Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Or maybe there should be a different mechanism altogether
> for non-local gotos. I'd like to see some kind of "longjmp"
> object that could be invoked to cause a jump back to
> a specific place. That would help alleviate the problem
> that exceptions used for control flow can get caught by
> the wrong handler. Sometimes you really want something
> that's targeted to a specific handler, not just the next
> enclosing one of some type.

I imagine you mean something like this...

    try:
        for ....:
            try:
                dosomething()
            except Exception:
                ...
    except FlowException1:
        ...

And the answer I've always heard is:

    try:
        for ....:
            try:
                dosomething()
            except FlowException1:
                raise
            except Exception:
                ...
    except FlowException1:
        ...


That really only works if you have control over the entire stack of
possible exception handlers, but it is also really the only way it
makes sense, unless I'm misunderstanding what you are asking for.  If I
am misunderstanding, please provide some sample code showing what needs
to be done now, and what you would like to be possible.

 - Josiah


From paul at prescod.net  Sun Sep 10 20:09:39 2006
From: paul at prescod.net (Paul Prescod)
Date: Sun, 10 Sep 2006 11:09:39 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <20060910103500.GA13412@phd.pp.ru>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<20060910103500.GA13412@phd.pp.ru>
Message-ID: <1cb725390609101109q52578a84o5382fd4e97e40ba0@mail.gmail.com>

Suggestion accepted.

On 9/10/06, Oleg Broytmann <phd at mail2.phd.pp.ru> wrote:
>
> On Sat, Sep 09, 2006 at 08:29:05PM -0700, Paul Prescod wrote:
> > "the protocol header says that this data is latin-1").
>
>    "Protocol metadata" if you allow me to suggest the word.
>
> Oleg.
> --
>      Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
>            Programmers don't die, they just GOSUB without RETURN.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060910/b9d2bca6/attachment.htm 

From paul at prescod.net  Sun Sep 10 20:14:07 2006
From: paul at prescod.net (Paul Prescod)
Date: Sun, 10 Sep 2006 11:14:07 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1157886177.4246.59.camel@fsol>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157886177.4246.59.camel@fsol>
Message-ID: <1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com>

I went based on the current setdefaultencoding. But it seems that we will
accumulate 3 or 4 related functions so I'm pursuaded that there should be a
module.

encodingdetection.setdefaultfileencoding
encodingdetection.registerencodingdetector
encodingdetection.guessfileencoding(filename)
encodingdetection.guessfileencoding(bytestream)

Suggestion accepted.

On 9/10/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
>
>
> > The Site Decoding Hook
> > ========================
> >
> > The "sys" module could have a function called
> > "setdefaultfileencoding". The encoding specified could be a true
> > encoding name or one of the encoding detection scheme names ( e.g.
> > "guess" or "XML").
>
> Isn't it more intuitive to gather functions based on what their
> high-level purpose is ("text" or "textfile") than implementation details
> of where the information comes from ("sys", "locale") ?
>
> That function could be "textfile.set_default_encoding" (with
> underscores), or even "text.textfile.set_default_encoding" (if all this
> resides in a "text" module).
>
> Regards
>
> Antoine.
>
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe:
> http://mail.python.org/mailman/options/python-3000/paul%40prescod.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060910/27244e91/attachment.html 

From jcarlson at uci.edu  Sun Sep 10 20:25:43 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sun, 10 Sep 2006 11:25:43 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <450418AC.2010400@blueyonder.co.uk>
References: <1157892435.4246.107.camel@fsol>
	<450418AC.2010400@blueyonder.co.uk>
Message-ID: <20060910111814.F8ED.JCARLSON@uci.edu>


David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Here is a very simple, reasonably (although not completely) safe, and much
> more predictable guessing algorithm, based on a generalization of
> <http://www.w3.org/TR/REC-xml/#sec-guessing>:
> 
>    Let A, B, C, and D be the first 4 bytes of the stream, or None if the
>      corresponding byte is past end-of-stream.
> 
>    Let other be any encoding which is to be used as a default if no specific
>      UTF is detected.
> 
>    if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8
>    if B == None: return other
>    if A == 0 and B == 0 and D != None: return UTF32BE
>    if C == 0 and D == 0: return UTF32LE
>    if A == 0xFE and B == 0xFF: return UTF16BE
>    if A == 0xFF and B == 0xFE: return UTF16LE
>    if A != 0 and B != 0: return other
>    if A == 0: return UTF16BE
>    return UTF16LE
> 
> This would normally be used with 'other' as the system encoding, as an alternative
> to just assuming that the file is in the system encoding.

Using the xml guessing mechanism is fine, as long as you get it right. 
A first pass with BOM detection and a second pass to "guess" based on
content in the case that a BOM isn't detected seems to make sense.

Note that the above algorithm returns UTF32BE for a files beginning with
4 null bytes.

 - Josiah


From paul at prescod.net  Sun Sep 10 20:25:07 2006
From: paul at prescod.net (Paul Prescod)
Date: Sun, 10 Sep 2006 11:25:07 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <4503FF8F.6070801@gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<4503FF8F.6070801@gmail.com>
Message-ID: <1cb725390609101125q26fba051ya086e5ed005e08c5@mail.gmail.com>

On 9/10/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
>
> Paul Prescod wrote:
> > The function to open a text file will tenatively be called textfile(),
> > though the function name is not an integral part of this PEP. The
> > function takes three arguments, the filename, the mode ("r", "w", "r+",
> > etc.) and the type.
> >
> > The type could be a true encoding or one of a small set of additional
> > symbolic values.
>
> The 'additional symbolic values' should be implemented as true encodings
> (i.e., it should be possible to look up 'site', 'guess' and 'locale' in
> the
> codecs registry, and replace them there as well).


I don't believe that these are "true" encodings because when you query a
stream for its encoding you will never find these names nor an alias for
them.

I also agree with Guido that the right spelling for the factory function is
> to
> incorporate this into the existing open() builtin. The signature of open()
> is
> already going to change to accept an encoding argument in Py3k, and the
> special encodings proposed in the PEP are just that: special encodings
> that
> happen to take environmental information into account when deciding how to
> decode or encode data.


Yes, well I disagree that the open function should get a new argument. I
think it should either be deprecated or used to open byte streams. The
function name is a hold over from Unix/C which has no resonance with a Java,
C#, Javascript, programmer.

Plus I would like to ease the writing of code that is both valid Python 2.xand
3.x. I'd advocate the strategy that we should try to have a large enough
behavioural overlap that modules can be written to run on both. Subtle
changes in semantics make this difficult. To the extent that this is
unavoidable (e.g. behaviour of very core syntax) I guess we'll have to live
with it. But we can easily add a function called textfile() to both Python
2.x and Python 3.x and ease the transition.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060910/645b0c28/attachment.htm 

From paul at prescod.net  Sun Sep 10 20:30:14 2006
From: paul at prescod.net (Paul Prescod)
Date: Sun, 10 Sep 2006 11:30:14 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1157892435.4246.107.camel@fsol>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol>
Message-ID: <1cb725390609101130w73f7a12bs243d8d6548b7b2d8@mail.gmail.com>

I don't mind your name of autotextfile but I think that your by_content
argument defeats the goal of having a very simple API for quick and dirty
stuff. If content detection is a good idea (usually right) then we should do
it. If it isn't, we shouldn't. I don't see a need for an argument to turn it
on and off. The programmer is not likely to have a lot more understanding
than we do of whether it is effective or not.

Also, there are two different levels of content detection (as someone later
in the thread pointed out). There is looking at BOMs, and there is a
statistical approach of looking for high characters and inferring their
encoding. I can't see an argument for ever turning off the BOM detection.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060910/40d9ff83/attachment.html 

From paul at prescod.net  Sun Sep 10 21:02:44 2006
From: paul at prescod.net (Paul Prescod)
Date: Sun, 10 Sep 2006 12:02:44 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <450418AC.2010400@blueyonder.co.uk>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk>
Message-ID: <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>

On 9/10/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
>
> Here is a very simple, reasonably (although not completely) safe, and much
> more predictable guessing algorithm, based on a generalization of
> <http://www.w3.org/TR/REC-xml/#sec-guessing>:


Your algorithm is more predictable but will confuse BOM-less UTF-8 with the
system encoding frequently. I haven't decided in my own mind whether that
trade-off is worth making. It will work well for:

 * Windows users, who will often find a BOM in their UTF-8

 * Western Unix/Linux users who will increasingly use UTF-8 as their system
encoding

It will not work well for:

 * Eastern Unix/Linux users using UTF-8 apps like gedit or apps "saving as"
UTF-8

 * Mac users using UTF-8 apps or saving as UTF-8.

I still haven't decided how I feel about that trade-off.

Maybe the guessing algorithm should read the WHOLE FILE. After all, we've
said repeatedly that it isn't for production use so making it a bit
inefficient is not a big problem and might even emphasize that point.

Modern I/O is astonishingly fast anyhow. On my computer it takes five
seconds to decode a quarter gigabyte of UTF-8 text through Python. That
would be a totally unacceptable waste for a production program, but for a
quick hack it wouldn't be bad. And it would guarantee that you would never
get an exception half-way through your parsing because of a bad character.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060910/0c19d780/attachment.html 

From solipsis at pitrou.net  Sun Sep 10 21:36:48 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sun, 10 Sep 2006 21:36:48 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk>
	<1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>
Message-ID: <1157917008.4257.8.camel@fsol>


Le dimanche 10 septembre 2006 ? 12:02 -0700, Paul Prescod a ?crit :
> Your algorithm is more predictable but will confuse BOM-less UTF-8
> with the system encoding frequently.

I don't think it is desirable to acknowledge only some kinds of UTF-8.
It will confuse the hell out of programmers, and users.

I'm not sure full-blown statistical analysis is necessary anyway. There
should be an ordered list of detectable encodings, which realistically
would be [all unicode variants, system default]. Then if you have a file
which is syntactically valid UTF-8, it most likely /is/ UTF-8 and not
ISO-8859-1 (for example).

> Modern I/O is astonishingly fast anyhow. On my computer it takes five
> seconds to decode a quarter gigabyte of UTF-8 text through Python.

Maybe we shouldn't be that presomptuous. Modern I/O is fast but memory
is not infinite. That quarter gigabyte will have swapped out other
data/code in order to make some place in the filesystem cache.
Also, Python is often used on more modest hardware.

Regards

Antoine.



From tjd at sfu.ca  Sun Sep 10 21:46:46 2006
From: tjd at sfu.ca (Toby Donaldson)
Date: Sun, 10 Sep 2006 12:46:46 -0700
Subject: [Python-3000] educational aspects of Python 3000
Message-ID: <a2f565170609101246s5d2e6fd1x7fac42826c0583e5@mail.gmail.com>

Hello,

There's been an explosion of discussion on the EDU-SIG list recently
about the removal of raw_input and input from Python 3000.

For teaching purposes, many educators report that they like raw_input
(and input). The basic argument is that, for beginners, code like

     name = raw_input('Morbo demands your name! ')

is clearer and easier than using sys.stdin.readline().

Some fear that a big mistake is being made here. Others just fear
getting bogged down in EDU-SIG discussions. :-)

Any suggestions for how educators interested in the
educational/learning aspects of Python 3000 could more fruitfully
participate?

For instance, would there be interest in the inclusion of a standard
educational library, a la the Java ACM library
(http://www-cs-faculty.stanford.edu/~eroberts//jtf/index.html)?

Toby
-- 
Dr. Toby Donaldson
School of Computing Science
Simon Fraser University (Surrey)

From solipsis at pitrou.net  Sun Sep 10 21:57:56 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sun, 10 Sep 2006 21:57:56 +0200
Subject: [Python-3000] content-based detection
In-Reply-To: <1cb725390609101130w73f7a12bs243d8d6548b7b2d8@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol>
	<1cb725390609101130w73f7a12bs243d8d6548b7b2d8@mail.gmail.com>
Message-ID: <1157918276.4257.30.camel@fsol>


Le dimanche 10 septembre 2006 ? 11:30 -0700, Paul Prescod a ?crit :
> I don't mind your name of autotextfile but I think that your
> by_content argument defeats the goal of having a very simple API for
> quick and dirty stuff. If content detection is a good idea (usually
> right) then we should do it.

Using system or locale default is trustable and reproduceable.
Content-based detection is wilder, especially if the algorithm isn't
fully refined in the first Py3k releases.

> I can't see an argument for ever turning off the BOM detection. 

Perhaps, but having a subset of it still running behind your back while
you disabled it is misleading.

Also, I think having BOM detection as the only test in content-based
detection would be uninteresting. The common use case for encoding
detection is to guess between one of Unicode variants (mostly UTF-8
*with or without BOM*) and the non-Unicode encoding which is popular for
a given language (e.g. ISO-8859-15).

I doubt many people have to discriminate between UTF-16LE, UCS-4 and
UTF-8. Are there real cases like that for text files?

Regards

Antoine.



From david.nospam.hopwood at blueyonder.co.uk  Sun Sep 10 22:01:10 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Sun, 10 Sep 2006 21:01:10 +0100
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <20060910111814.F8ED.JCARLSON@uci.edu>
References: <1157892435.4246.107.camel@fsol>
	<450418AC.2010400@blueyonder.co.uk>
	<20060910111814.F8ED.JCARLSON@uci.edu>
Message-ID: <45046F06.5090502@blueyonder.co.uk>

Josiah Carlson wrote:
> David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> 
>>Here is a very simple, reasonably (although not completely) safe, and much
>>more predictable guessing algorithm, based on a generalization of
>><http://www.w3.org/TR/REC-xml/#sec-guessing>:
>>
>>   Let A, B, C, and D be the first 4 bytes of the stream, or None if the
>>     corresponding byte is past end-of-stream.
>>
>>   Let other be any encoding which is to be used as a default if no specific
>>     UTF is detected.
>>
>>   if A == 0xEF and B == 0xBB and C == 0xBF: return UTF8
>>   if B == None: return other
>>   if A == 0 and B == 0 and D != None: return UTF32BE
>>   if C == 0 and D == 0: return UTF32LE
>>   if A == 0xFE and B == 0xFF: return UTF16BE
>>   if A == 0xFF and B == 0xFE: return UTF16LE
>>   if A != 0 and B != 0: return other
>>   if A == 0: return UTF16BE
>>   return UTF16LE
>>
>>This would normally be used with 'other' as the system encoding, as an alternative
>>to just assuming that the file is in the system encoding.
> 
> Using the xml guessing mechanism is fine, as long as you get it right. 
> A first pass with BOM detection and a second pass to "guess" based on
> content in the case that a BOM isn't detected seems to make sense.

... if you think that guessing based on content is a good idea -- I don't.
In any case, such guessing necessarily depends on the expected file format,
so it should be done by the application itself, or by a library that knows
more about the format.

If the encoding of a text stream were settable after it had been opened,
then it would be easy for anyone to implement whatever guessing algorithm
they needed, without having to write an encoding implementation or include
any other support for guessing in the I/O library itself.

(This also requires the ability to seek back to the beginning of the stream
after reading the data needed for the guess.)

> Note that the above algorithm returns UTF32BE for a files beginning with
> 4 null bytes.

Yes. But such a thing probably isn't a text file at all -- in which case
there will be subsequent decoding errors when most of the code units are
not in the range 0 to 0x10FFFF.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From david.nospam.hopwood at blueyonder.co.uk  Sun Sep 10 23:12:34 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Sun, 10 Sep 2006 22:12:34 +0100
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>	
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>	
	<1157892435.4246.107.camel@fsol>
	<450418AC.2010400@blueyonder.co.uk>
	<1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>
Message-ID: <45047FC2.40904@blueyonder.co.uk>

Paul Prescod wrote:
> Maybe the guessing algorithm should read the WHOLE FILE.

That wouldn't work for streams (e.g. stdin). The algorithm I gave
does work for streams, provided that they have a push-back buffer of
at least 4 bytes.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From jcarlson at uci.edu  Sun Sep 10 23:47:13 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sun, 10 Sep 2006 14:47:13 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <45046F06.5090502@blueyonder.co.uk>
References: <20060910111814.F8ED.JCARLSON@uci.edu>
	<45046F06.5090502@blueyonder.co.uk>
Message-ID: <20060910143817.F8F9.JCARLSON@uci.edu>


David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Josiah Carlson wrote:
[snip]
> > Using the xml guessing mechanism is fine, as long as you get it right. 
> > A first pass with BOM detection and a second pass to "guess" based on
> > content in the case that a BOM isn't detected seems to make sense.
> 
> ... if you think that guessing based on content is a good idea -- I don't.
> In any case, such guessing necessarily depends on the expected file format,
> so it should be done by the application itself, or by a library that knows
> more about the format.

I'm keeping my hat out of the ring for whether guessing is a good idea. 
However, if one is going to have a guessing mechanic, starting with UTF
BOMS is a good start, which is what I was trying to say.


> If the encoding of a text stream were settable after it had been opened,
> then it would be easy for anyone to implement whatever guessing algorithm
> they needed, without having to write an encoding implementation or include
> any other support for guessing in the I/O library itself.

That is true.  But considering that you, presumably an experienced
programmer with regards to unicode, have provided an algorithm with an
obvious hole that I was able to discover in a few moments, suggests that
guessing algorithms are not easy to write.

> (This also requires the ability to seek back to the beginning of the stream
> after reading the data needed for the guess.)
> 
> > Note that the above algorithm returns UTF32BE for a files beginning with
> > 4 null bytes.
> 
> Yes. But such a thing probably isn't a text file at all -- in which case
> there will be subsequent decoding errors when most of the code units are
> not in the range 0 to 0x10FFFF.

A file starting with 4 nulls certainly will likely imply a non-text file
of some kind, but presuming that "most" code points would not be in the
0...0x10ffff range is a bit of assumption about the content of a file. I
thought you didn't want to guess.


 - Josiah


From paul at prescod.net  Mon Sep 11 06:09:47 2006
From: paul at prescod.net (Paul Prescod)
Date: Sun, 10 Sep 2006 21:09:47 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1157917008.4257.8.camel@fsol>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk>
	<1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>
	<1157917008.4257.8.camel@fsol>
Message-ID: <1cb725390609102109j6e4c1e7bk18087b5319928abf@mail.gmail.com>

On 9/10/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
>
> ...
> > Modern I/O is astonishingly fast anyhow. On my computer it takes five
> > seconds to decode a quarter gigabyte of UTF-8 text through Python.
>
> Maybe we shouldn't be that presomptuous. Modern I/O is fast but memory
> is not infinite. That quarter gigabyte will have swapped out other
> data/code in order to make some place in the filesystem cache.

Not really. It works in 16k chunks.

> Also, Python is often used on more modest hardware.

People writing programs to deal with vast amounts of data on modest
computers are trying to do something advanced and should not use the
quick and dirty guessing algorithms. We're not trying to hit 100% of
programmers and situations. Not even close. The PEP was very explicit
about that fact.

 Paul Presocd

From paul at prescod.net  Mon Sep 11 06:11:11 2006
From: paul at prescod.net (Paul Prescod)
Date: Sun, 10 Sep 2006 21:11:11 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <45047FC2.40904@blueyonder.co.uk>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk>
	<1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>
	<45047FC2.40904@blueyonder.co.uk>
Message-ID: <1cb725390609102111u44287761i8b509729aa6f5ce1@mail.gmail.com>

The PEP doesn't deal with streams. It is about files.

On 9/10/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Paul Prescod wrote:
> > Maybe the guessing algorithm should read the WHOLE FILE.
>
> That wouldn't work for streams (e.g. stdin). The algorithm I gave
> does work for streams, provided that they have a push-back buffer of
> at least 4 bytes.
>
> --
> David Hopwood <david.nospam.hopwood at blueyonder.co.uk>
>
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/paul%40prescod.net
>

From paul at prescod.net  Mon Sep 11 06:31:00 2006
From: paul at prescod.net (Paul Prescod)
Date: Sun, 10 Sep 2006 21:31:00 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <45046F06.5090502@blueyonder.co.uk>
References: <1157892435.4246.107.camel@fsol>
	<450418AC.2010400@blueyonder.co.uk>
	<20060910111814.F8ED.JCARLSON@uci.edu>
	<45046F06.5090502@blueyonder.co.uk>
Message-ID: <1cb725390609102131h59b75866ha66395bf55181da@mail.gmail.com>

On 9/10/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Josiah Carlson wrote:
> ... if you think that guessing based on content is a good idea -- I don't.
> In any case, such guessing necessarily depends on the expected file format,
> so it should be done by the application itself, or by a library that knows
> more about the format.

I disagree. If a non-trivial file can be decoded as a UTF-* encoding
it probably is that encoding. I don't see how it matters whether the
file represents Latex or an .htaccess file. XML is a special case
because it is specially designed to make encoding detection (not
guessing, but detection) easy.

> If the encoding of a text stream were settable after it had been opened,
> then it would be easy for anyone to implement whatever guessing algorithm
> they needed, without having to write an encoding implementation or include
> any other support for guessing in the I/O library itself.

But this defeats the whole purpose of the PEP which is to accelerate
the writing of quick and dirty text processing scripts.

 Paul Prescod

From paul at prescod.net  Mon Sep 11 06:42:01 2006
From: paul at prescod.net (Paul Prescod)
Date: Sun, 10 Sep 2006 21:42:01 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <87d5a366xd.fsf@qrnik.zagroda>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<87d5a366xd.fsf@qrnik.zagroda>
Message-ID: <1cb725390609102142m3a1dd33ha8400ec8e2005ba@mail.gmail.com>

On 9/10/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
>...
> Other than that, guessing the encoding from the contents of the text
> stream, especially statistical guessing basing on well-formed UTF-8
> non-ASCII characters, shouldn't be encouraged, because it's effect is
> not predictable.

My thinking has evolved. The "guess" mode should "virtually" try
different decodings until one succeeds. In the worst case this might
involve decoding the whole file twice (once for detection and once for
application processing).

In general, your proposal is too far from the goals that were given to
me by Guido for me to really evaluate it as an alternative. Guido's
goal was that quick and dirty text processing should "just work" for
newbies and encoding-disintererested expert programmers. I don't think
that your proposal achieves that.

 Paul Prescod

From jeff at soft.fujitsu.com  Mon Sep 11 06:54:03 2006
From: jeff at soft.fujitsu.com (Jeff Wilcox)
Date: Mon, 11 Sep 2006 13:54:03 +0900
Subject: [Python-3000] Help on text editors
In-Reply-To: <1cb725390609091058j49ffcdc6h61ce7eb80700f011@mail.gmail.com>
Message-ID: <LEEEILEJNIIMMBJHAKAPCEKFCCAA.jeff@soft.fujitsu.com>

> Great: but what is the default Textedit encoding on a Japanized version of
the Mac?
>  Paul Prescod

I'm fairly sure that the settings on the computer I looked at this on are
default, but I borrowed the machine so I can't guarantee it.
In textpad with OS X set to Japanese there were three choices of encoding:
EUC-JP, ISO-2022-JP and Shift_JIS.  The dropdown defaulted to Shift_JIS.

The (reversible) procedure that I used to change the language back and forth
is:

System Preferences > International > Language
Drag the language you wish to use to the top of the list. Log out, then back
in again and it should be in the language you chose.
If only one language is listed, then the language pack(s) are most likely
not installed. They can be installed from the original OS X install CD/DVD.



From walter at livinglogic.de  Mon Sep 11 12:00:38 2006
From: walter at livinglogic.de (Walter =?iso-8859-1?Q?D=F6rwald?=)
Date: Mon, 11 Sep 2006 12:00:38 +0200 (CEST)
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157886177.4246.59.camel@fsol>
	<1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com>
Message-ID: <61353.89.54.51.133.1157968838.squirrel@isar.livinglogic.de>

Paul Prescod wrote:

> I went based on the current setdefaultencoding. But it seems that we will
> accumulate 3 or 4 related functions so I'm pursuaded that there should be a
> module.
>
> encodingdetection.setdefaultfileencoding
> encodingdetection.registerencodingdetector
> encodingdetection.guessfileencoding(filename)
> encodingdetection.guessfileencoding(bytestream)
>
> Suggestion accepted.

There's no need for implementing a separate infrastructure for encoding detection. This can be implemented as a "meta codec".
See http://styx.livinglogic.de/~walter/xml_codec/xml_codec.py for a codec that autodetects the XML encoding.

Servus,
   Walter




From phd at phd.pp.ru  Mon Sep 11 12:30:31 2006
From: phd at phd.pp.ru (Oleg Broytmann)
Date: Mon, 11 Sep 2006 14:30:31 +0400
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol>
	<450418AC.2010400@blueyonder.co.uk>
	<1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>
Message-ID: <20060911103031.GD29600@phd.pp.ru>

On Sun, Sep 10, 2006 at 12:02:44PM -0700, Paul Prescod wrote:
> * Eastern Unix/Linux users using UTF-8 apps like gedit or apps "saving as"
> UTF-8

   Finally I've got the definitive answer for "is Russia Europe or Asia?"
It is an Eastern country! At last! ;)

> Maybe the guessing algorithm should read the WHOLE FILE.

   Zen: "In the face of ambiguity, refuse the temptation to guess."

   Unfortunately this contradicts to not the only idea how much to read
but the to whole idea to guess encoding. So may be we are going in the
wrong direction. IMHO the right direction is to include a guessing script
in Tools directory.

Oleg.
-- 
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From qrczak at knm.org.pl  Mon Sep 11 12:38:49 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Mon, 11 Sep 2006 12:38:49 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609102142m3a1dd33ha8400ec8e2005ba@mail.gmail.com> (Paul
	Prescod's message of "Sun, 10 Sep 2006 21:42:01 -0700")
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<87d5a366xd.fsf@qrnik.zagroda>
	<1cb725390609102142m3a1dd33ha8400ec8e2005ba@mail.gmail.com>
Message-ID: <8764fu3cxy.fsf@qrnik.zagroda>

"Paul Prescod" <paul at prescod.net> writes:

> Guido's goal was that quick and dirty text processing should "just
> work" for newbies and encoding-disintererested expert programmers.

What does 'guess' mean for creating files?

Consider a program which reads one file and writes data extracted
from it (e.g. with lines beginning with '#' removed) to another file.

With my proposal it will work if the encoding of the file is the same
as the locale encoding (or if they can be harmlessly confused).
It will just work most of the time.

It will not work in general if the encodings are different. In this
case the user of the script can override the encoding assumption
by temporarily changing the locale or by changing an environment
variable.



OTOH when the encoding is guessed from file contents, what happens
depending on how it's designed. If the locale is ISO-8859-x:

1. Files are created in the locale encoding.

   Then some UTF-8 files will be silently recoded to a different
   encoding, and for other UTF-8 files writing will fail (if they
   contain characters not expressible in the locale encoding).

2. Files are created in UTF-8.

   Then files encoded with the locale encoding will be silently
   recoded to UTF-8, causing trouble for further work with the file
   (it can't be even typed to the terminal).

If the locale is UTF-8, but the reader assumes e.g. ISO-8859-1 when
it can't decode as UTF-8, there will be a silent recoding for these
files. If the file is in fact encoded in ISO-8859-2, the result will
be nonsensical: looking as UTF-8 but with characters substituted
according to ISO-8859-2/1 differences.

In either case it's not clear what the user of the script can do
to preserve the encoding in the output file.

I claim that in my design the result is more easily predictable
and easier to fix when it goes wrong.



I've implemented a hack which allows simple programs to "just work" in
case of UTF-8. It's a modified encoder/decoder which escapes malformed
UTF-8 sequences with '\0' bytes, and thus allows arbitrary byte
sequences to round-trip UTF-8 decoding and encoding. It's not used by
default and it's never used when "UTF-8" is specified explicitly,
because it's not the true UTF-8, but I have an environment variable
which says "if the locale is UTF-8, use the modified UTF-8 as the
default encoding".

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From barry at python.org  Mon Sep 11 13:35:51 2006
From: barry at python.org (Barry Warsaw)
Date: Mon, 11 Sep 2006 07:35:51 -0400
Subject: [Python-3000] iostack, second revision
In-Reply-To: <45045105.5040209@jmunch.dk>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>	<4501A1B1.5050707@gmail.com>	<200609081403.18350.fdrake@acm.org>	<ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>	<1157740873.4979.10.camel@fsol>	<fb6fbf560609081204t1f4f1a6fo817e533f306f4861@mail.gmail.com>
	<450254DB.3020502@gmail.com> <45045105.5040209@jmunch.dk>
Message-ID: <0D784B1A-DB20-4D27-A11E-4AED4B76152B@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sep 10, 2006, at 1:53 PM, Anders J. Munch wrote:

> I say drop seek_cur and seek_end altogether, and keep only absolute
> seek.

I was just looking through some of our elf/dwarf parsing code and we  
use seek_cur quite a bit.  Not that it couldn't be rewritten to use  
absolute seek, but it's also not the most natural interface.  I'd opt  
for keeping those interfaces for binary files since there are use- 
cases where they are useful.

- -Barry


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQCVAwUBRQVKGHEjvBPtnXfVAQL0GwP+KG8NflbbSoUxHLIkCyMd+NFj2fR1GAU5
dfu7cIc/oJpx25VxqgcDqM3IdKqp5CyJLG7AjtPXm8SuWGba3YmunHAcvnPPmP6Z
qdxAI8KD+Sf/imEuB7te29AUGlFteh+6IGKJKBMjxiXSjjqw2lwhDQphyhVPKuHp
3j+oly6uZ8E=
=/1N6
-----END PGP SIGNATURE-----

From paul at prescod.net  Mon Sep 11 15:58:42 2006
From: paul at prescod.net (Paul Prescod)
Date: Mon, 11 Sep 2006 06:58:42 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <20060911103031.GD29600@phd.pp.ru>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol> <450418AC.2010400@blueyonder.co.uk>
	<1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>
	<20060911103031.GD29600@phd.pp.ru>
Message-ID: <1cb725390609110658t40cbafc8q66f93b1b03b7eabc@mail.gmail.com>

On 9/11/06, Oleg Broytmann <phd at phd.pp.ru> wrote:
>
> On Sun, Sep 10, 2006 at 12:02:44PM -0700, Paul Prescod wrote:
> > * Eastern Unix/Linux users using UTF-8 apps like gedit or apps "saving
> as"
> > UTF-8
>
>    Finally I've got the definitive answer for "is Russia Europe or Asia?"
> It is an Eastern country! At last! ;)


For these purposes, Russia is European, isn't it? Russian text can be
subsumed by UTF-8 with relatively minor expansion, right? If so, then I
would guess that UTF-8 would replace KOI8-R and iso8859-? for Russian
eventually.

> Maybe the guessing algorithm should read the WHOLE FILE.
>
>    Zen: "In the face of ambiguity, refuse the temptation to guess."
>
>    Unfortunately this contradicts to not the only idea how much to read
> but the to whole idea to guess encoding. So may be we are going in the
> wrong direction. IMHO the right direction is to include a guessing script
> in Tools directory.


That was the position I started with. Guido wanted a guessing mode. So I
designed what seemed to me to be the least dangerous guessing mode possible:

 1. Off by default.
 2. Turned on by the keyword "guess".
 3. Decodes the full text to check for encoding correctness.

Given these safeguards, I think that the feature is not only safe enough but
also helpful.

Moving it to a script would not meet the central goal that it be easily
usable by people who do not know much about encodings or Python.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060911/9b38a31b/attachment.html 

From paul at prescod.net  Mon Sep 11 16:15:07 2006
From: paul at prescod.net (Paul Prescod)
Date: Mon, 11 Sep 2006 07:15:07 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <8764fu3cxy.fsf@qrnik.zagroda>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<87d5a366xd.fsf@qrnik.zagroda>
	<1cb725390609102142m3a1dd33ha8400ec8e2005ba@mail.gmail.com>
	<8764fu3cxy.fsf@qrnik.zagroda>
Message-ID: <1cb725390609110715t4caca46bya5f9b2508d216c7@mail.gmail.com>

On 9/11/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
>
> "Paul Prescod" <paul at prescod.net> writes:
>
> > Guido's goal was that quick and dirty text processing should "just
> > work" for newbies and encoding-disintererested expert programmers.
>
> What does 'guess' mean for creating files?


I wasn't sure about this one. But on Windows and Mac it seems safe to
generate UTF-8-with-BOM. Textedit, VIM and notepad all auto-detect the UTF-8
BOM and do the right thing.

2. Files are created in UTF-8.
>
>    Then files encoded with the locale encoding will be silently
>    recoded to UTF-8, causing trouble for further work with the file
>    (it can't be even typed to the terminal).


It can on the teriminal on the mac. And on the increasing number of UTF-8
defaulted Linux distributions. Perhaps it should by default use the Unix
locale for output, but only on Unix and not on mac/Windows.

I've implemented a hack which allows simple programs to "just work" in
> case of UTF-8. It's a modified encoder/decoder which escapes malformed
> UTF-8 sequences with '\0' bytes, and thus allows arbitrary byte
> sequences to round-trip UTF-8 decoding and encoding. It's not used by
> default and it's never used when "UTF-8" is specified explicitly,
> because it's not the true UTF-8, but I have an environment variable
> which says "if the locale is UTF-8, use the modified UTF-8 as the
> default encoding".


That's an interesting idea. I'm not sure if you are proposing it as being
applicable to this PEP or not...

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060911/f5e90075/attachment.htm 

From phd at phd.pp.ru  Mon Sep 11 16:23:04 2006
From: phd at phd.pp.ru (Oleg Broytmann)
Date: Mon, 11 Sep 2006 18:23:04 +0400
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609110658t40cbafc8q66f93b1b03b7eabc@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>
	<1157892435.4246.107.camel@fsol>
	<450418AC.2010400@blueyonder.co.uk>
	<1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>
	<20060911103031.GD29600@phd.pp.ru>
	<1cb725390609110658t40cbafc8q66f93b1b03b7eabc@mail.gmail.com>
Message-ID: <20060911142303.GA12119@phd.pp.ru>

On Mon, Sep 11, 2006 at 06:58:42AM -0700, Paul Prescod wrote:
> For these purposes, Russia is European, isn't it?

   If the test is "a BOM in UTF-8 text files on Unices" - then no. :)

> Russian text can be subsumed by UTF-8 with relatively minor expansion, right?

   Sorry, what do you mean? That russian encodings can be converted to
UTF-8? Yes, they can. But the most popular encoding here is cp1251, not
UTF-8. Even on Unices there are people who use cp1251 as their main
encoding (locale, fonts, keyboard mapping) because they often switch
between a number of platforms.

> If so, then I
> would guess that UTF-8 would replace KOI8-R and iso8859-? for Russian
> eventually.

   On Unix? Probably yes, but not in the nearest future. There are some
popular tools (for me the most notable is Midnight Commander) that still
have problems with UTF-8 locales.

> Given these safeguards, I think that the feature is not only safe enough but
> also helpful.

   Ok then.

Oleg.
-- 
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From mcherm at mcherm.com  Mon Sep 11 18:44:47 2006
From: mcherm at mcherm.com (Michael Chermside)
Date: Mon, 11 Sep 2006 09:44:47 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
Message-ID: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>

Paul Prescod writes:
     [... Pre-PEP proposal ...]

Quick thoughts:

  * I like it. Good work.
  * I agree with Guido: "open" is the right spelling for this.
  * I agree with Paul: mandatory specification is the way to go.  
10,000 different blog entries, tutorials, and cookbook recipies can  
recomend using "guess" if you don't know what you're doing. Or they  
can all recomend using "site". Or they can all recomend using "utf-8".  
I'm not sure what they'll all recomend, and that's enough reason for  
me to require the user to say. If we later decide that one is an  
acceptable default, we could make it optional in release 3.1, 3.2, or  
3.3... but if we make it optional from the start then it can never  
become required.

Other thoughts after reading everyone else's replies:

  * Guessing. Hmm. Broad topic. If the option for guessing were  
spelled "guess" (rather than, say "autodetect") then I would have been  
scared off from using it in "production code" but I would still feel  
free to use in quick-and-dirty scripting. On the other hand, I'm not  
sure I'm a good "typical programmer". Fortunately, your PEP works fine  
whether or not "guess" is allowed, so I can support your PEP without  
having to commit on the idea of having a "guess" option.

-- Michael Chermside


From mcherm at mcherm.com  Mon Sep 11 20:22:15 2006
From: mcherm at mcherm.com (Michael Chermside)
Date: Mon, 11 Sep 2006 11:22:15 -0700
Subject: [Python-3000] educational aspects of Python 3000
Message-ID: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>

Toby Donaldson writes:
> Any suggestions for how educators interested in the
> educational/learning aspects of Python 3000 could more fruitfully
> participate?

You're doing pretty well so far! Seriously... just speak up: Pythonistas
(including, in particular, Guido) value the fact that Python is an
excellent language for beginners, and we'll go out of our way to keep
it so. But you might need to speak up.

Elsewhere:
> For teaching purposes, many educators report that they like raw_input
> (and input). The basic argument is that, for beginners, code like
>
>      name = raw_input('Morbo demands your name! ')
>
> is clearer and easier than using sys.stdin.readline().
       [...]
> For instance, would there be interest in the inclusion of a standard
> educational library...

Personally, I think input() should never have existed and must go
no matter what. I think raw_input() is worth discussing -- I wouldn't
need it, but it's little more than a convenience function.

The idea of a standard edu library though is a GREAT one. That would
provide a standard place for things like raw_input() (with a better
name) as well as lots of other "helper functions" useful to beginners
and/or students -- and all it would cost is a single line of boilerplate
at the top of each program ("from beginnerlib import *" or something
like that).

I suspect that such a library would be enthusiastically welcomed into
the Python core distribution *IF* there was clear consensus about
what it should contain. So if the EDU-SIG could do the hard work of
obtaining the consensus (and mark my words... it IS hard work), I
think you'd be 90% of the way there.

-- Michael Chermside


From brett at python.org  Mon Sep 11 20:26:51 2006
From: brett at python.org (Brett Cannon)
Date: Mon, 11 Sep 2006 11:26:51 -0700
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
Message-ID: <bbaeab100609111126x1d57997dhe6e99c7810e94fdb@mail.gmail.com>

On 9/11/06, Michael Chermside <mcherm at mcherm.com> wrote:
>
> Toby Donaldson writes:
> > Any suggestions for how educators interested in the
> > educational/learning aspects of Python 3000 could more fruitfully
> > participate?
>
> You're doing pretty well so far! Seriously... just speak up: Pythonistas
> (including, in particular, Guido) value the fact that Python is an
> excellent language for beginners, and we'll go out of our way to keep
> it so. But you might need to speak up.
>
> Elsewhere:
> > For teaching purposes, many educators report that they like raw_input
> > (and input). The basic argument is that, for beginners, code like
> >
> >      name = raw_input('Morbo demands your name! ')
> >
> > is clearer and easier than using sys.stdin.readline().
>        [...]
> > For instance, would there be interest in the inclusion of a standard
> > educational library...
>
> Personally, I think input() should never have existed and must go
> no matter what.


Agreed.  Teach the folks eval() quick if you want something like that.

I think raw_input() is worth discussing -- I wouldn't
> need it, but it's little more than a convenience function.


Yeah, but when you are learning it's cool to take input easily.  I loved
raw_input() when I started out.

The idea of a standard edu library though is a GREAT one. That would
> provide a standard place for things like raw_input() (with a better
> name) as well as lots of other "helper functions" useful to beginners
> and/or students -- and all it would cost is a single line of boilerplate
> at the top of each program ("from beginnerlib import *" or something
> like that).
>
> I suspect that such a library would be enthusiastically welcomed into
> the Python core distribution *IF* there was clear consensus about
> what it should contain. So if the EDU-SIG could do the hard work of
> obtaining the consensus (and mark my words... it IS hard work), I
> think you'd be 90% of the way there.


Yeah.  Stuff that normally trips up beginners could be put in here with
pointers to how to do it properly when they get more advanced.  And making
the name seem very newbie will (hopefully) discourage people from using it
beyond their learning code.

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060911/5656a63b/attachment.html 

From solipsis at pitrou.net  Mon Sep 11 20:42:46 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Mon, 11 Sep 2006 20:42:46 +0200
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
Message-ID: <1158000166.4672.33.camel@fsol>


Le lundi 11 septembre 2006 ? 11:22 -0700, Michael Chermside a ?crit :
> The idea of a standard edu library though is a GREAT one. That would
> provide a standard place for things like raw_input() (with a better
> name) as well as lots of other "helper functions" useful to beginners
> and/or students -- and all it would cost is a single line of boilerplate
> at the top of each program ("from beginnerlib import *" or something
> like that).

There is a risk with beginner-specific library: it's the same problem as
with user interfaces which have  "simple" and "advanced" modes. Often
the "simple" mode becomes an excuse for lazy developers to turn the
"advanced" mode into a painful mess (under the flawed pretext that
advanced users can suffer the pain anyway).

And if the helper functions are genuinely useful, why would they be only
for beginners and students?

IMHO, it would be better to label the module "scripting" rather than
"beginnerlib" (and why append "lib" at the end of module names
anyway? :-)).
It might even contain stuff such as encoding guessing.

>>> from scripting import raw_input, autotextfile

Regards

Antoine.



From p.f.moore at gmail.com  Mon Sep 11 23:49:34 2006
From: p.f.moore at gmail.com (Paul Moore)
Date: Mon, 11 Sep 2006 22:49:34 +0100
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
Message-ID: <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>

On 9/11/06, Michael Chermside <mcherm at mcherm.com> wrote:
> Paul Prescod writes:
>      [... Pre-PEP proposal ...]
>
> Quick thoughts:

My quick thoughts on this whole subject:

* Yes, it should be "open". Anything else feels like gratuitous breakage.
* There should be a default encoding, and it should be the system
default one. If I don't take special steps, most tools I use save in
the system default encoding, so Python should follow this approach as
well.
* I don't mind corrupted characters for unusual cases. Really, I don't.
* The bizarre Windows behavious of using different encodings for
console and GUI programs doesn't bother me either. Really. I promise.

99.99% of the time I simply don't care about i18n. All I want is
something that runs on the machine(s) I'm using. Using the system
locale is fine for that.

In the rare cases where I *do* care about international characters, I
have no problem doing work and research to get things right. And when
I've done that, detecting encodings and specifying the right thing in
an open() call is entirely OK.

Of course, I'm in the useful position of having an OS default
character set which contains ASCII as a subset. I don't know what
issues someone with Greek/Russian/Japanese or whatever as an OS
default would have (one thought - if your default character set
doesn't contain ASCII as a subset, how do you deal with the hosts
file? OTOH, I had a real struggle to find an example of an encoding
which didn't have ASCII as a subset!)

Paul.

From jimjjewett at gmail.com  Mon Sep 11 23:53:40 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 11 Sep 2006 17:53:40 -0400
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157886177.4246.59.camel@fsol>
	<1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com>
Message-ID: <fb6fbf560609111453t7509985ayff501ff4666189a0@mail.gmail.com>

On 9/10/06, Paul Prescod <paul at prescod.net> wrote:

> encodingdetection.setdefaultfileencoding
> encodingdetection. registerencodingdetector
> encodingdetection.guessfileencoding(filename)
> encodingdetection.guessfileencoding(bytestream)

This demonstrates two of problems with requiring an explicit decision.

(1)  You still won't actually get one; you'll just lose the
information that it wasn't even considered.

(2)  You'll add so much boilerplate that you invite other bugs.


Suddenly,

    >>> f=open("runlist.txt")

turns into something more like

    >>> import encodingdetection
    ...
    >>> f=open("runlist.txt",
encoding=encodingdetection.guessfileencoding("runlast.txt"))

I certainly wouldn't read a line like that without a good reason; I
wouldn't even notice that the encoding guess was based on a different
file.

It will be an annoying amount of typing though, during which time I'll
be thinking:

"It doesn't really matter what encoding is used; if there is anything
outside of ASCII, it is because the user put it there, and all I have
to do is copy it around unchanged."

For situations like that, if there were *ever* a reason to specify a
particular encoding, I *still* wouldn't get it right, because it is
something that hasn't occurred to me.  I guess the explicitness means
that the error is now my fault instead of python's, but the error is
still there, and someone else is more reluctant to fix it.  (Well,
this *was* an explicit choice -- maybe I had a reason?)

But since the encoding is mandatory, I do still have to deal with it,
by making my code longer and uglier.  In the end, packages will end up
distributing their own non-standard convenience wrappers, so that the
equivalent of

>>> f=open("runlist.txt")

can still be used -- but you'll have to read the whole module and the
imports (and the star-imports) to figure out what it means/whether it
is shadowed.  You can't even scan for "open" because someone may have
named their convenience wrapper get_file.  If packages A and B
disagree about the default encoding, it will be even harder to find
and fix than it is today.

-jJ

From paul at prescod.net  Tue Sep 12 00:09:02 2006
From: paul at prescod.net (Paul Prescod)
Date: Mon, 11 Sep 2006 15:09:02 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <fb6fbf560609111453t7509985ayff501ff4666189a0@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<1157886177.4246.59.camel@fsol>
	<1cb725390609101114i60f2c8bei5c67885dff7b3f0a@mail.gmail.com>
	<fb6fbf560609111453t7509985ayff501ff4666189a0@mail.gmail.com>
Message-ID: <1cb725390609111509u39725d5al72294b0d89009eb6@mail.gmail.com>

I think that the basis of your concern is a misunderstanding of the
proposal (at least as documented in the PEP).

On 9/11/06, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 9/10/06, Paul Prescod <paul at prescod.net> wrote:
>
> > encodingdetection.setdefaultfileencoding
> > encodingdetection. registerencodingdetector
> > encodingdetection.guessfileencoding(filename)
> > encodingdetection.guessfileencoding(bytestream)

Those last two are helper functions exposing the functionality of the
"guess" keyword through a different means.

> This demonstrates two of problems with requiring an explicit decision.
>
> (1)  You still won't actually get one; you'll just lose the
> information that it wasn't even considered.

I frankly don't think that that makes any sense. If there is a default
then how can I know whether someone thought about it and decided to
use the default or did not think it through and decided to use the
default.

> (2)  You'll add so much boilerplate that you invite other bugs.
>
>
> Suddenly,
>
>     >>> f=open("runlist.txt")
>
> turns into something more like
>
>     >>> import encodingdetection
>     ...
>     >>> f=open("runlist.txt",
> encoding=encodingdetection.guessfileencoding("runlast.txt"))

No, that was never the proposal. The proposal is:

f = open("runlist.txt", "guess")

> "It doesn't really matter what encoding is used; if there is anything
> outside of ASCII, it is because the user put it there, and all I have
> to do is copy it around unchanged."

Yes, if you are doing something utterly trivial with the text as
opposed to the normal case where you are comparing it with some other
input, combining it with some other input, putting it in a database,
serving it up over the Web etc. Even Unix "cat" would need to be
encoding aware if it were created today and designed to be i18n
friendly.

> For situations like that, if there were *ever* a reason to specify a
> particular encoding, I *still* wouldn't get it right, because it is
> something that hasn't occurred to me. I guess the explicitness means
> that the error is now my fault instead of python's, but the error is
> still there, and someone else is more reluctant to fix it.  (Well,
> this *was* an explicit choice -- maybe I had a reason?)

The documentation for the "guess" keyword will be clear that it is
NEVER the correct choice for production-quality software. That's one
of the virtues of having an explicit keyword for the quick and dirty
mode (as opposed to making it the default as you seem to wish).

> But since the encoding is mandatory, I do still have to deal with it,
> by making my code longer and uglier.  In the end, packages will end up
> distributing their own non-standard convenience wrappers, so that the
> equivalent of
>
> >>> f=open("runlist.txt")

No, I don't think they'll do that to avoid typing 7 extra characters.

 Paul Prescod

From guido at python.org  Tue Sep 12 00:18:32 2006
From: guido at python.org (Guido van Rossum)
Date: Mon, 11 Sep 2006 15:18:32 -0700
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <1158000166.4672.33.camel@fsol>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
	<1158000166.4672.33.camel@fsol>
Message-ID: <ca471dc20609111518o73e119b5p49a5b9c9c5e9a09b@mail.gmail.com>

On 9/11/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Le lundi 11 septembre 2006 ? 11:22 -0700, Michael Chermside a ?crit :
> > The idea of a standard edu library though is a GREAT one. That would
> > provide a standard place for things like raw_input() (with a better
> > name) as well as lots of other "helper functions" useful to beginners
> > and/or students -- and all it would cost is a single line of boilerplate
> > at the top of each program ("from beginnerlib import *" or something
> > like that).
>
> There is a risk with beginner-specific library: it's the same problem as
> with user interfaces which have  "simple" and "advanced" modes. Often
> the "simple" mode becomes an excuse for lazy developers to turn the
> "advanced" mode into a painful mess (under the flawed pretext that
> advanced users can suffer the pain anyway).

Please give us more credit than that.

> And if the helper functions are genuinely useful, why would they be only
> for beginners and students?

DrScheme has several levels for beginners and experts and in between,
so they think it is really useful to have different levels. I'm torn;
I wish a single level would apply to all but I know that many
educators provide some kind of "training wheels" library for their
students.

> IMHO, it would be better to label the module "scripting" rather than
> "beginnerlib" (and why append "lib" at the end of module names
> anyway? :-)).
> It might even contain stuff such as encoding guessing.
>
> >>> from scripting import raw_input, autotextfile

I'm not so keen on 'scripting' as the name either, but I'm sure we can
come up with something. Perhaps easyio, simpleio or basicio? (Not to
be confused with vbio. :-)

I'm also not completely against revising the decision on killing
raw_input(). While input() must definitely go, raw_input() might
survive under a new name. Too bad calling it input() would be too
confusing from a Python 2.x POV, and I don't want to call it
readline() because it strips the trailing newline and raises EOF on
error. Unless the educators can line with having to use
readline().strip() instead of raw_input()...?

Perhaps the educators (less Art :-) can get together and write a PEP?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From paul at prescod.net  Tue Sep 12 00:30:15 2006
From: paul at prescod.net (Paul Prescod)
Date: Mon, 11 Sep 2006 15:30:15 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
Message-ID: <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>

On 9/11/06, Paul Moore <p.f.moore at gmail.com> wrote:
> On 9/11/06, Michael Chermside <mcherm at mcherm.com> wrote:
> > Paul Prescod writes:
> >      [... Pre-PEP proposal ...]
> >
> > Quick thoughts:
>
> My quick thoughts on this whole subject:
>
> * Yes, it should be "open". Anything else feels like gratuitous breakage.
> * There should be a default encoding, and it should be the system
> default one. If I don't take special steps, most tools I use save in
> the system default encoding, so Python should follow this approach as
> well.

So just to be clear: you want to keep the function name "open" but
change its behaviour. For example, the ord() of high characters
returned by open will be completely different than today. And the
syntax for "open" of binary files will be different (in fact, whether
it reads the file or throws an exception will depend on your locale).

> The bizarre Windows behavious of using different
> encodings for console and GUI programs doesn't
> bother me either. Really. I promise."

So according to this philosophy, Windows and Mac users will probably
never be able to open UTF-8 documents by default even if every
Microsoft app generates and consumes UTF-8 by default, because
Microsoft and Apple will probably _never change the default locale_
for backwards compatibility reasons. Their philosophy seems to be that
the locale is irrelevant in the age of Unicode and therefore there is
no reason to upgrade it at a risk of "breaking" applications that were
hard-coded to expect a specific locale.

 Paul Prescod

From qrczak at knm.org.pl  Tue Sep 12 01:20:28 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 12 Sep 2006 01:20:28 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
	(Paul Moore's message of "Mon, 11 Sep 2006 22:49:34 +0100")
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
Message-ID: <87d5a2f0sj.fsf@qrnik.zagroda>

"Paul Moore" <p.f.moore at gmail.com> writes:

> Of course, I'm in the useful position of having an OS default
> character set which contains ASCII as a subset. I don't know what
> issues someone with Greek/Russian/Japanese or whatever as an OS
> default would have (one thought - if your default character set
> doesn't contain ASCII as a subset, how do you deal with the hosts
> file? OTOH, I had a real struggle to find an example of an encoding
> which didn't have ASCII as a subset!)

AFAIK the only encoding which might be used today which is not based
on ASCII is EBCDIC. Perl supports it (and it supports Unicode at the
same time, via UTF-EBCDIC).

Other than that, there are some Japanese encodings with a confusion
between \ and the Yen sign, otherwise being ASCII. They are used today.

There used to be national ASCII variants with accented letters instead
of [\]^{|}~. I don't think they are still used today.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From david.nospam.hopwood at blueyonder.co.uk  Tue Sep 12 01:25:15 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Tue, 12 Sep 2006 00:25:15 +0100
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609102131h59b75866ha66395bf55181da@mail.gmail.com>
References: <1157892435.4246.107.camel@fsol>	
	<450418AC.2010400@blueyonder.co.uk>	
	<20060910111814.F8ED.JCARLSON@uci.edu>	
	<45046F06.5090502@blueyonder.co.uk>
	<1cb725390609102131h59b75866ha66395bf55181da@mail.gmail.com>
Message-ID: <4505F05B.8070503@blueyonder.co.uk>

Paul Prescod wrote:
> On 9/10/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> 
>> ... if you think that guessing based on content is a good idea -- I
>> don't. In any case, such guessing necessarily depends on the expected file
>> format, so it should be done by the application itself, or by a library that
>> knows more about the format.
> 
> I disagree. If a non-trivial file can be decoded as a UTF-* encoding
> it probably is that encoding.

That is quite false for UTF-16, at least. It is also false for short UTF-8
files.

> I don't see how it matters whether the
> file represents Latex or an .htaccess file. XML is a special case
> because it is specially designed to make encoding detection (not
> guessing, but detection) easy.

Many other frequently used formats also necessarily start with an ASCII
character and do not contain NULs, which is at least sufficient to reliably
detect UTF-16 and UTF-32.

>> If the encoding of a text stream were settable after it had been opened,
>> then it would be easy for anyone to implement whatever guessing algorithm
>> they needed, without having to write an encoding implementation or
>> include any other support for guessing in the I/O library itself.
> 
> But this defeats the whole purpose of the PEP which is to accelerate
> the writing of quick and dirty text processing scripts.

That doesn't justify making the behaviour of those scripts "dirtier" than
necessary.

I think that the focus should be on solving a set of well-defined problems,
for which BOM detection can definitely help:

Suppose we have a system in which some of the files are in a potentially
non-Unicode 'system' encoding, and some are Unicode. The user of the system
needs a reliable way of marking the Unicode files so that the encoding of
*those* files can be distinguished. In addition, a provider of portable
software or documentation needs a way to encode files for distribution that
is independent of the system encoding, since (before run-time) they don't
know what encoding that will on any given system. Use and detection of
Byte Order Marks solves both of these problems.

You appear to be arguing for the common use of much more ambitious heuristic
guessing, which *cannot* be made reliable. I am not opposed to providing
support for such guessing in the Python standard library, but only if its
limitations are thoroughly documented, and only if it is not the default.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From david.nospam.hopwood at blueyonder.co.uk  Tue Sep 12 01:29:13 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Tue, 12 Sep 2006 00:29:13 +0100
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609102111u44287761i8b509729aa6f5ce1@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>	
	<1157884285.4246.41.camel@fsol> <4503FDC8.2030608@gmail.com>	
	<1157892435.4246.107.camel@fsol>
	<450418AC.2010400@blueyonder.co.uk>	
	<1cb725390609101202x644fe40fm1e234b564890c8d5@mail.gmail.com>	
	<45047FC2.40904@blueyonder.co.uk>
	<1cb725390609102111u44287761i8b509729aa6f5ce1@mail.gmail.com>
Message-ID: <4505F149.8030509@blueyonder.co.uk>

Paul Prescod wrote:
> The PEP doesn't deal with streams. It is about files.

An important part of the Unix design philosophy (partially adopted by Windows)
is to make streams and files behave as similarly as possible. It is quite
feasible to make *some* detection algorithms work for streams, and this is
an advantage over algorithms that don't work for streams.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From qrczak at knm.org.pl  Tue Sep 12 01:44:13 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 12 Sep 2006 01:44:13 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>
	(Paul Prescod's message of "Mon, 11 Sep 2006 15:30:15 -0700")
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
	<1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>
Message-ID: <87r6yij7ea.fsf@qrnik.zagroda>

"Paul Prescod" <paul at prescod.net> writes:

>> The bizarre Windows behavious of using different
>> encodings for console and GUI programs doesn't
>> bother me either. Really. I promise."
>
> So according to this philosophy, Windows and Mac users will probably
> never be able to open UTF-8 documents by default even if every
> Microsoft app generates and consumes UTF-8 by default, because
> Microsoft and Apple will probably _never change the default locale_
> for backwards compatibility reasons.

This can be solved for file reading by making a "Windows locale"
always consider UTF-8 BOM and switch to UTF-8 in this case.

It's still unclear what to do for writing on Windows.

I have no idea what Mac does (does it typically use UTF-8 locales?
and does it typicaly use a BOM in UTF-8?).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From paul at prescod.net  Tue Sep 12 02:41:59 2006
From: paul at prescod.net (Paul Prescod)
Date: Mon, 11 Sep 2006 17:41:59 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <4505F05B.8070503@blueyonder.co.uk>
References: <1157892435.4246.107.camel@fsol>
	<450418AC.2010400@blueyonder.co.uk>
	<20060910111814.F8ED.JCARLSON@uci.edu>
	<45046F06.5090502@blueyonder.co.uk>
	<1cb725390609102131h59b75866ha66395bf55181da@mail.gmail.com>
	<4505F05B.8070503@blueyonder.co.uk>
Message-ID: <1cb725390609111741v3b7ef92aufec5aa960711768c@mail.gmail.com>

On 9/11/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> > I disagree. If a non-trivial file can be decoded as a UTF-* encoding
> > it probably is that encoding.
>
> That is quite false for UTF-16, at least. It is also false for short UTF-8
> files.

True UTF-16 (as opposed to UTF-16 BE/UTF 16 LE) files have a BOM.
Also, you can recognize incorrect ones through misuse of surrogates.

> > I don't see how it matters whether the
> > file represents Latex or an .htaccess file. XML is a special case
> > because it is specially designed to make encoding detection (not
> > guessing, but detection) easy.
>
> Many other frequently used formats also necessarily start with an ASCII
> character and do not contain NULs, which is at least sufficient to reliably
> detect UTF-16 and UTF-32.

Yes, but these are the two easiest ones.

> > But this defeats the whole purpose of the PEP which is to accelerate
> > the writing of quick and dirty text processing scripts.
>
> That doesn't justify making the behaviour of those scripts "dirtier" than
> necessary.
>
> I think that the focus should be on solving a set of well-defined problems,
> for which BOM detection can definitely help:
>
> Suppose we have a system in which some of the files are in a potentially
> non-Unicode 'system' encoding, and some are Unicode. The user of the system
> needs a reliable way of marking the Unicode files so that the encoding of
> *those* files can be distinguished.

If the user understands the problem and is willing to go to this level
of effort then they are not the target user of the feature.

> ... In addition, a provider of portable
> software or documentation needs a way to encode files for distribution that
> is independent of the system encoding, since (before run-time) they don't
> know what encoding that will on any given system. Use and detection of
> Byte Order Marks solves both of these problems.

Sure, that's great.

> You appear to be arguing for the common use of much more ambitious heuristic
> guessing, which *cannot* be made reliable.

First, the word "guess" necessarily implies unreliability. Guido
started this whole chain of discussion when he said:

"(Auto-detection from sniffing the data is a perfectly valid answer
BTW -- I see no reason why that couldn't be one option, as long as
there's a way to disable it.)"

> ... I am not opposed to providing
> support for such guessing in the Python standard library, but only if its
> limitations are thoroughly documented, and only if it is not the default.

Those are both characteristics of the proposal that started this
thread so what are we arguing about?

Since writing the PEP, I've noticed that the strategy of trying to
decode as UTF-* and falling back to an 8-bit character set is actually
pretty common in text editors, which implies that Python's behaviour
here can be highly similar to text editors. This was the key
requirement Guido gave me in an off-list email for the guessing mode.

VIM: "fileencodings: This is a list of character encodings considered
when starting to edit a file.  When a file is read, Vim tries to use
the first mentioned character encoding.  If an error is detected, the
next one in the list is tried.  When an encoding is found that works,
'fileencoding' is set to it.	"

Reading the docs, one can infer that this feature is specifically
designed to support UTF-8 sniffing. I would guess that the default
configuration has it do UTF-8 sniffing.

BBEdit: "If the file contains no other cues to indicate its text
encoding, and its contents appear to be valid UTF-8, BBEdit will open
it as UTF-8 (No BOM) without recourse to the preferences option."

 Paul Prescod

From paul at prescod.net  Tue Sep 12 03:16:15 2006
From: paul at prescod.net (Paul Prescod)
Date: Mon, 11 Sep 2006 18:16:15 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <87r6yij7ea.fsf@qrnik.zagroda>
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
	<1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>
	<87r6yij7ea.fsf@qrnik.zagroda>
Message-ID: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>

On 9/11/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
> "Paul Prescod" <paul at prescod.net> writes:
>
> >> The bizarre Windows behavious of using different
> >> encodings for console and GUI programs doesn't
> >> bother me either. Really. I promise."
> >
> > So according to this philosophy, Windows and Mac users will probably
> > never be able to open UTF-8 documents by default even if every
> > Microsoft app generates and consumes UTF-8 by default, because
> > Microsoft and Apple will probably _never change the default locale_
> > for backwards compatibility reasons.
>
> This can be solved for file reading by making a "Windows locale"
> always consider UTF-8 BOM and switch to UTF-8 in this case.

That's fine but I don't see why we would turn that feature off for any
platform. Do you have a bunch of files hanging around starting with
zero-width non-breaking spaces?

> It's still unclear what to do for writing on Windows.

UTF-8 with BOM is the Microsoft preferred format. Maybe after
experimentation we'll find that there are still apps out there that
choke on it, but we should start out trying to be compatible with
other apps on the platform.

> I have no idea what Mac does (does it typically use UTF-8 locales?
> and does it typicaly use a BOM in UTF-8?).

Like Windows, the Mac has backwards-compatible behaviours in some
places (textedit defaults to a proprietary encoding called Mac Roman)
and UTF-8 behaviours in other places (e.g. cut and paste). In some
places (on my configuration) it claims its locale is US ASCII.

Textedit can read files with a BOM and auto-detect Unicode with a BOM.
It always saves without a BOM, which results in the unfortunate
situation that Textedit will recognize a file's encoding, then save
it, then forget its encoding when you reopen it. :(

But again, this implies that at least on these two platforms UTF-8
w/BOM is a good default output encoding.

On Unix, VIM is also set up to auto-detect UTF-8 (using the BOM or
full decoding attemption). According to Google, XEmacs also has some
kind of UTF-8/BOM detector but I don't know the details. GNU Emacs:
According to "Emacs wiki": "Auto-detection of UTF-8 is effectively
disabled by default in GNU Emacs 21.3 and below."

So the situation on Unix is not as clear.

 Paul Prescod

From talin at acm.org  Tue Sep 12 04:48:39 2006
From: talin at acm.org (Talin)
Date: Mon, 11 Sep 2006 19:48:39 -0700
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <ca471dc20609111518o73e119b5p49a5b9c9c5e9a09b@mail.gmail.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>	<1158000166.4672.33.camel@fsol>
	<ca471dc20609111518o73e119b5p49a5b9c9c5e9a09b@mail.gmail.com>
Message-ID: <45062007.1040909@acm.org>

Guido van Rossum wrote:
>>>>> from scripting import raw_input, autotextfile
> 
> I'm not so keen on 'scripting' as the name either, but I'm sure we can
> come up with something. Perhaps easyio, simpleio or basicio? (Not to
> be confused with vbio. :-)
> 
> I'm also not completely against revising the decision on killing
> raw_input(). While input() must definitely go, raw_input() might
> survive under a new name. Too bad calling it input() would be too
> confusing from a Python 2.x POV, and I don't want to call it
> readline() because it strips the trailing newline and raises EOF on
> error. Unless the educators can line with having to use
> readline().strip() instead of raw_input()...?

How about calling it 'ask'?

 >>> s = ask( "How are you today?" )
--> Fine
 >>> s
"Fine"

And as far as the name of a library goes how about "quickstart"? Other 
possibilities are: quickstudy, kickstart, simplestart, etc.

"With the Python quickstart module, programming is as easy as 
one...two...five!"

-- Talin


From greg.ewing at canterbury.ac.nz  Tue Sep 12 05:20:50 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 12 Sep 2006 15:20:50 +1200
Subject: [Python-3000] The future of exceptions
In-Reply-To: <874pvgozlo.fsf@qrnik.zagroda>
References: <20060908094550.zdd30l5x9tk0koow@login.werra.lunarpages.com>
	<4503AC79.4090601@canterbury.ac.nz> <874pvgozlo.fsf@qrnik.zagroda>
Message-ID: <45062792.3040207@canterbury.ac.nz>

Marcin 'Qrczak' Kowalczyk wrote:

> It's lazily instantiated today (see PyErr_NormalizeException).

Only in C code, though, not Python. And if the
separate type/value specification when raising
goes away, it might not be possible any more
even in C.

> 'WithExit' constructs a unique exception object and catches precisely
> this object.

That would fill the bill, yes.

So it's really just a matter of making sure
we keep the ability to be as lazy as possible
with exception processing.

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From greg.ewing at canterbury.ac.nz  Tue Sep 12 05:34:42 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 12 Sep 2006 15:34:42 +1200
Subject: [Python-3000] sys.stdin and sys.stdout with textfile
In-Reply-To: <ca471dc20609101004t2d55b686x4908c39981467106@mail.gmail.com>
References: <1157898432.4246.161.camel@fsol>
	<ca471dc20609101004t2d55b686x4908c39981467106@mail.gmail.com>
Message-ID: <45062AD2.1090207@canterbury.ac.nz>

Guido van Rossum wrote:

> All sorts of things are different when reading stdin vs. opening a
> filename. e.g. stdin may be a pipe.

Which suggests that if anything is going to try
to guess the encoding, it would be better for it
to start reading from the actual stream you're
going to use and buffer the result, rather than
rely on being able to open it separately.

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From greg.ewing at canterbury.ac.nz  Tue Sep 12 05:37:55 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 12 Sep 2006 15:37:55 +1200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <ca471dc20609101009x3ebe0482x5ceb786e748558b@mail.gmail.com>
References: <1cb725390609092029v6dae753ajf8bf85e4974bb4d0@mail.gmail.com>
	<4503FF8F.6070801@gmail.com>
	<ca471dc20609101009x3ebe0482x5ceb786e748558b@mail.gmail.com>
Message-ID: <45062B93.3080207@canterbury.ac.nz>

Guido van Rossum wrote:
> if possible I'd like the
> guessing function to have access to what was in the file before it was
> emptied by the "create" function, or what's at the start before
> appending to the end,

Which further suggests that the encoding-guesser
needs to be fairly intimately built into some
layer of the i/o stack, and not require calling
a separate function (although it could be provided
as such in case you want to use it that way).

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From greg.ewing at canterbury.ac.nz  Tue Sep 12 06:18:37 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 12 Sep 2006 16:18:37 +1200
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
Message-ID: <4506351D.4040109@canterbury.ac.nz>

Michael Chermside wrote:

> The idea of a standard edu library though is a GREAT one. That would
> provide a standard place for things like raw_input() (with a better
> name) as well as lots of other "helper functions" useful to beginners
> and/or students -- and all it would cost is a single line of boilerplate
> at the top of each program ("from beginnerlib import *" or something
> like that).

I disagree for two reasons:

1) Even a single line of boilerplate is too much
when you're trying to pare things down to the
bare minimum for a beginner.

2) It teaches a bad habit right from the
beginning (i.e. using 'import *'). This is the
wrong foot to start a beginner off on.


-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From greg.ewing at canterbury.ac.nz  Tue Sep 12 06:36:01 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 12 Sep 2006 16:36:01 +1200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <45045105.5040209@jmunch.dk>
References: <1d85506f0609071030o60a02ac5j739027d1825f0e7f@mail.gmail.com>
	<ca471dc20609080937u5a9cfdf9l5b6a02be0396a9f9@mail.gmail.com>
	<4501A1B1.5050707@gmail.com> <200609081403.18350.fdrake@acm.org>
	<ca471dc20609081106n4e542b1fobd8a5a76c1c9c838@mail.gmail.com>
	<1157740873.4979.10.camel@fsol>
	<fb6fbf560609081204t1f4f1a6fo817e533f306f4861@mail.gmail.com>
	<450254DB.3020502@gmail.com> <45045105.5040209@jmunch.dk>
Message-ID: <45063931.3050904@canterbury.ac.nz>

Anders J. Munch wrote:
> any file that supports seeking to the end will also support
> reporting the file size.  Thus
>   f.seek(f.length)
> should suffice,

Although the micro-optimisation circuit in my
brain complains that it will take 2 system
calls when it could be done with 1...

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From solipsis at pitrou.net  Tue Sep 12 08:37:58 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Tue, 12 Sep 2006 08:37:58 +0200
Subject: [Python-3000] text editors
In-Reply-To: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
	<1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>
	<87r6yij7ea.fsf@qrnik.zagroda>
	<1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>
Message-ID: <1158043078.4276.6.camel@fsol>

Le lundi 11 septembre 2006 ? 18:16 -0700, Paul Prescod a ?crit :
> On Unix, VIM is also set up to auto-detect UTF-8 (using the BOM or
> full decoding attemption). According to Google, XEmacs also has some
> kind of UTF-8/BOM detector but I don't know the details. GNU Emacs:
> According to "Emacs wiki": "Auto-detection of UTF-8 is effectively
> disabled by default in GNU Emacs 21.3 and below."
> 
> So the situation on Unix is not as clear.

gedit has an ordered list of encodings to test for when it opens a file,
and it chooses the first encoding which succeeds in decoding the file.

The encoding list is stored as a gconf key named "auto_detected"
in /apps/gedit-2/preferences/encodings, and it's default value is
[UTF-8, CURRENT, ISO-8859-15]
("CURRENT" being interpreted as the current locale).

I suppose the explicit fallback to iso-8859-15 is for the common case
where the user has a Western European language, his user locale is utf-8
and he has some non-Unicode files hanging around...

Regards

Antoine.



From ajm at flonidan.dk  Tue Sep 12 09:01:01 2006
From: ajm at flonidan.dk (Anders J. Munch)
Date: Tue, 12 Sep 2006 09:01:01 +0200
Subject: [Python-3000] iostack, second revision
Message-ID: <9B1795C95533CA46A83BA1EAD4B01030031F52@flonidanmail.flonidan.net>

Greg Ewing wrote:
> Anders J. Munch wrote:
> > any file that supports seeking to the end will also support
> > reporting the file size.  Thus
> >   f.seek(f.length)
> > should suffice,
> 
> Although the micro-optimisation circuit in my
> brain complains that it will take 2 system
> calls when it could be done with 1...

I don't expect file methods and systems calls to map one to one, but
you're right, the first time the length is needed, that's an extra
system call.

My micro-optimisation circuitry blew a fuse when I discovered that
seek always implies flush.  You won't get good performance out of code
that does a lot of seeks, whatever you do.  Use my upcoming FileBytes
class :)

- Anders

From tony at PageDNA.com  Tue Sep 12 04:27:09 2006
From: tony at PageDNA.com (Tony Lownds)
Date: Mon, 11 Sep 2006 19:27:09 -0700
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <ca471dc20609111518o73e119b5p49a5b9c9c5e9a09b@mail.gmail.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
	<1158000166.4672.33.camel@fsol>
	<ca471dc20609111518o73e119b5p49a5b9c9c5e9a09b@mail.gmail.com>
Message-ID: <0CC6FE11-6DD0-4923-B417-21273679A15A@PageDNA.com>

>> IMHO, it would be better to label the module "scripting" rather than
>> "beginnerlib" (and why append "lib" at the end of module names
>> anyway? :-)).
>> It might even contain stuff such as encoding guessing.
>>
>>>>> from scripting import raw_input, autotextfile
>
> I'm not so keen on 'scripting' as the name either, but I'm sure we can
> come up with something. Perhaps easyio, simpleio or basicio? (Not to
> be confused with vbio. :-)
>

How about simpleui? This is a user interface routine.

> I'm also not completely against revising the decision on killing
> raw_input(). While input() must definitely go, raw_input() might
> survive under a new name. Too bad calling it input() would be too
> confusing from a Python 2.x POV, and I don't want to call it
> readline() because it strips the trailing newline and raises EOF on
> error. Unless the educators can line with having to use
> readline().strip() instead of raw_input()...?
>

Javascript provides prompt, confirm, alert. They are all very useful  
as user interface
routines and would be just as useful in Python. Maybe raw_input could  
survive as "prompt".

Alternatively, here is a way to keep the "input" name with a useful  
extension to the semantics of input().
Add a "converter" argument that defaults to eval in Python 2.6.  
"eval" is deprecated as an argument in
2.7. In Python 3.0 the default gets changed to "str". The user's  
input is passed through the converter
function. Any exceptions from the converter cause input() to prompt  
the user again. Code is below.

sys.stdin.readline() doesn't use the readline library. That is a nice  
feature of the current raw_input() and
input() builtins. I don't think this feature can even be emulated  
with the current readline module.

-Tony Lownds

import sys
def input(prompt='', converter=eval):
   while 1:
     sys.stdout.write(prompt)
     sys.stdout.flush()
     line = sys.stdin.readline().rstrip('\n\r')
     try:
       return converter(line)
     except (KeyboardInterrupt, SystemExit):
       raise
     except Exception, e:
       print str(e)

if __name__ == '__main__':
   print "Result: %s" % input("Enter string:", str)
   print "Result: %d" % input("Enter integer:", int)
   print "Result: %r" % input("Enter expression:")

Here's how it looks when run:

Enter string:12
Result: 12
Enter integer:1a
invalid literal for int(): 1a
Enter integer:
invalid literal for int():
Enter integer:12
Result: 12
Enter expression:a b c
invalid syntax (line 1)
Enter expression:abc
name 'abc' is not defined
Enter expression:'abc'
Result: 'abc'




From ncoghlan at gmail.com  Tue Sep 12 15:45:44 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Tue, 12 Sep 2006 23:45:44 +1000
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <bbaeab100609111126x1d57997dhe6e99c7810e94fdb@mail.gmail.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
	<bbaeab100609111126x1d57997dhe6e99c7810e94fdb@mail.gmail.com>
Message-ID: <4506BA08.7090905@gmail.com>

Brett Cannon wrote:
> On 9/11/06, *Michael Chermside* <mcherm at mcherm.com 
>     Personally, I think input() should never have existed and must go
>     no matter what.
> 
> 
> Agreed.  Teach the folks eval() quick if you want something like that.

The world would probably be a happier place if you taught them int() and 
float() instead, though :)

>     I think raw_input() is worth discussing -- I wouldn't
>     need it, but it's little more than a convenience function.
> 
> 
> Yeah, but when you are learning it's cool to take input easily.  I loved 
> raw_input() when I started out.

We could always rename raw_input() to input(). Just a thought. . .

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From ncoghlan at gmail.com  Tue Sep 12 15:47:15 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Tue, 12 Sep 2006 23:47:15 +1000
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <4506BA08.7090905@gmail.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>	<bbaeab100609111126x1d57997dhe6e99c7810e94fdb@mail.gmail.com>
	<4506BA08.7090905@gmail.com>
Message-ID: <4506BA63.7040201@gmail.com>

Nick Coghlan wrote:
> We could always rename raw_input() to input(). Just a thought. . .

D'oh. Guido already said he doesn't like that idea :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From rhettinger at ewtllc.com  Tue Sep 12 16:25:06 2006
From: rhettinger at ewtllc.com (Raymond Hettinger)
Date: Tue, 12 Sep 2006 07:25:06 -0700
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <4506BA63.7040201@gmail.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>	<bbaeab100609111126x1d57997dhe6e99c7810e94fdb@mail.gmail.com>	<4506BA08.7090905@gmail.com>
	<4506BA63.7040201@gmail.com>
Message-ID: <4506C342.6010202@ewtllc.com>

An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060912/8bcdcbc9/attachment.html 

From jcarlson at uci.edu  Tue Sep 12 18:05:50 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Tue, 12 Sep 2006 09:05:50 -0700
Subject: [Python-3000] iostack, second revision
In-Reply-To: <9B1795C95533CA46A83BA1EAD4B01030031F52@flonidanmail.flonidan.net>
References: <9B1795C95533CA46A83BA1EAD4B01030031F52@flonidanmail.flonidan.net>
Message-ID: <20060912085752.F915.JCARLSON@uci.edu>


"Anders J. Munch" <ajm at flonidan.dk> wrote:
> 
> Greg Ewing wrote:
> > Anders J. Munch wrote:
> > > any file that supports seeking to the end will also support
> > > reporting the file size.  Thus
> > >   f.seek(f.length)
> > > should suffice,
> > 
> > Although the micro-optimisation circuit in my
> > brain complains that it will take 2 system
> > calls when it could be done with 1...
> 
> I don't expect file methods and systems calls to map one to one, but
> you're right, the first time the length is needed, that's an extra
> system call.

Every time the length is needed, a system call is required (you can have
multiple writers of the same file)...

>>> import os
>>> a = open('test.txt', 'a')
>>> b = open('test.txt', 'a')
>>> a.write('hello')
>>> b.write('whee!!')
>>> a.flush()
>>> os.fstat(a.fileno()).st_size
5L
>>> b.flush()
>>> os.fstat(b.fileno()).st_size
11L
>>>


> My micro-optimisation circuitry blew a fuse when I discovered that
> seek always implies flush.  You won't get good performance out of code
> that does a lot of seeks, whatever you do.  Use my upcoming FileBytes
> class :)

Flushing during seek is important.  By not flushing during seek in your
FileBytes object, you are unnecessarily delaying writes, which could
cause file corruption.

 - Josiah


From tjd at sfu.ca  Tue Sep 12 18:51:18 2006
From: tjd at sfu.ca (Toby Donaldson)
Date: Tue, 12 Sep 2006 09:51:18 -0700
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <4506C342.6010202@ewtllc.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
	<bbaeab100609111126x1d57997dhe6e99c7810e94fdb@mail.gmail.com>
	<4506BA08.7090905@gmail.com> <4506BA63.7040201@gmail.com>
	<4506C342.6010202@ewtllc.com>
Message-ID: <a2f565170609120951r79da32d6j694de125452c508@mail.gmail.com>

On 9/12/06, Raymond Hettinger <rhettinger at ewtllc.com> wrote:
>
>  We could always rename raw_input() to input(). Just a thought. . .
>
>  D'oh. Guido already said he doesn't like that idea :)
>
>
>
>  FWIW, I think it is a good idea.  If there is a little 2.x vs 3.0
> confusion, so be it.   The use of input() function is already somewhat rare
> (both because of infrequent use cases and because of the stern warnings
> about eval's security issues).  It is better to bite the bullet and move on
> than it would be to avoid the most obvious name.

I agree ... "input" is perhaps the best name from a beginners point of
view, and it is only a minor inconvenience for experienced
programmers.

Toby
-- 
Dr. Toby Donaldson
School of Computing Science
Simon Fraser University (Surrey)

From tjd at sfu.ca  Tue Sep 12 19:11:29 2006
From: tjd at sfu.ca (Toby Donaldson)
Date: Tue, 12 Sep 2006 10:11:29 -0700
Subject: [Python-3000] educational aspects of Python 3000
Message-ID: <a2f565170609121011l28a69241n97817905decfb0@mail.gmail.com>

> How about calling it 'ask'?
>
>  >>> s = ask( "How are you today?" )
> --> Fine
>  >>> s
> "Fine"
>
> And as far as the name of a library goes how about "quickstart"? Other
> possibilities are: quickstudy, kickstart, simplestart, etc.
>
> "With the Python quickstart module, programming is as easy as
> one...two...five!"

:-)

Actually, some educators (not me so much, but I see where they are
coming from) have negative reactions to library names like "teach" or
"edu" because they feel it sends a message to students that they are
not learning "real" Python.

A positive sounding names like "quickstart" would avoid this problem.

Toby
-- 
Dr. Toby Donaldson
School of Computing Science
Simon Fraser University (Surrey)

From talin at acm.org  Tue Sep 12 19:35:39 2006
From: talin at acm.org (Talin)
Date: Tue, 12 Sep 2006 10:35:39 -0700
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <a2f565170609121011l28a69241n97817905decfb0@mail.gmail.com>
References: <a2f565170609121011l28a69241n97817905decfb0@mail.gmail.com>
Message-ID: <4506EFEB.6040001@acm.org>

Toby Donaldson wrote:
>> How about calling it 'ask'?
>>
>>  >>> s = ask( "How are you today?" )
>> --> Fine
>>  >>> s
>> "Fine"
>>
>> And as far as the name of a library goes how about "quickstart"? Other
>> possibilities are: quickstudy, kickstart, simplestart, etc.
>>
>> "With the Python quickstart module, programming is as easy as
>> one...two...five!"
> 
> :-)
> 
> Actually, some educators (not me so much, but I see where they are
> coming from) have negative reactions to library names like "teach" or
> "edu" because they feel it sends a message to students that they are
> not learning "real" Python.
> 
> A positive sounding names like "quickstart" would avoid this problem.

That was exactly my thinking.

-- Talin

From nnorwitz at gmail.com  Tue Sep 12 19:53:13 2006
From: nnorwitz at gmail.com (Neal Norwitz)
Date: Tue, 12 Sep 2006 10:53:13 -0700
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <4506C342.6010202@ewtllc.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
	<bbaeab100609111126x1d57997dhe6e99c7810e94fdb@mail.gmail.com>
	<4506BA08.7090905@gmail.com> <4506BA63.7040201@gmail.com>
	<4506C342.6010202@ewtllc.com>
Message-ID: <ee2a432c0609121053p33c98f7dm419314ec3a6b2fe2@mail.gmail.com>

On 9/12/06, Raymond Hettinger <rhettinger at ewtllc.com> wrote:
>
>  We could always rename raw_input() to input(). Just a thought. . .
>
>  D'oh. Guido already said he doesn't like that idea :)
>
>  FWIW, I think it is a good idea.  If there is a little 2.x vs 3.0
> confusion, so be it.   The use of input() function is already somewhat rare
> (both because of infrequent use cases and because of the stern warnings
> about eval's security issues).  It is better to bite the bullet and move on
> than it would be to avoid the most obvious name.

I agree.  Plus we are already doing something similar for {}.keys()
etc by changing them in a somewhat subtle way.  I also recall
something weird when I ripped out input wrt readline or something.  I
don't recall if I checked in the removal of {raw_,}input or not.

This is also something easy to look for and flag.  pychecker already
catches uses of input and warns about it.

n

From rrr at ronadam.com  Tue Sep 12 23:03:30 2006
From: rrr at ronadam.com (Ron Adam)
Date: Tue, 12 Sep 2006 16:03:30 -0500
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <4506C342.6010202@ewtllc.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>	<bbaeab100609111126x1d57997dhe6e99c7810e94fdb@mail.gmail.com>	<4506BA08.7090905@gmail.com>	<4506BA63.7040201@gmail.com>
	<4506C342.6010202@ewtllc.com>
Message-ID: <ee77gu$3sa$1@sea.gmane.org>

Raymond Hettinger wrote:
> 
>>> We could always rename raw_input() to input(). Just a thought. . .
>>>     
>>
>> D'oh. Guido already said he doesn't like that idea :)
>>
>>   
> 
> FWIW, I think it is a good idea.  If there is a little 2.x vs 3.0 
> confusion, so be it.   The use of input() function is already somewhat 
> rare (both because of infrequent use cases and because of the stern 
> warnings about eval's security issues).  It is better to bite the bullet 
> and move on than it would be to avoid the most obvious name.
> 
> Raymond

Maybe "input" can be depreciated in 2.x with a messages to use eval(raw_input()) 
instead.  That would limit some of the confusion.




From rhettinger at ewtllc.com  Tue Sep 12 23:58:19 2006
From: rhettinger at ewtllc.com (Raymond Hettinger)
Date: Tue, 12 Sep 2006 14:58:19 -0700
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <ee77gu$3sa$1@sea.gmane.org>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>	<bbaeab100609111126x1d57997dhe6e99c7810e94fdb@mail.gmail.com>	<4506BA08.7090905@gmail.com>	<4506BA63.7040201@gmail.com>	<4506C342.6010202@ewtllc.com>
	<ee77gu$3sa$1@sea.gmane.org>
Message-ID: <45072D7B.4020202@ewtllc.com>

Ron Adam wrote:

>Maybe "input" can be depreciated in 2.x with a messages to use eval(raw_input()) 
>instead.  That would limit some of the confusion.
>
>  
>

Let me take this opportunity to articulate a principle that I hope this 
group will adopt, "Thou shalt not muck-up Py2.x in the name of Py3k."

Given that Py3k will not be backwards compatible in many ways, we may 
expect that tons of code will remain in the 2.x world and it behooves us 
not to burden that massive codebase with Py3k oriented deprecations, 
warnings, etc.  It's okay to backport compatible feature additions and I 
expect that a number of people will author third-party transition tools, 
but let's not gum-up the current, wildly successful strain of Python.  
Expect that 2.x will continue to live side-by-side with Py3k for a long 
time.  It is a bit premature to read the will and auction-off the estate ;-)

Any ideas for Py3k that are part of the natural evolution of the 2.x 
series can of course be done in parallel, but each 2.x proposal needs to 
be evaluated on its own merits.  IOW, "limiting 2.x vs 3k confusion" is 
NOT a sufficient reason to change 2.x.


Raymond





From bjourne at gmail.com  Wed Sep 13 00:56:27 2006
From: bjourne at gmail.com (=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=)
Date: Wed, 13 Sep 2006 00:56:27 +0200
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <4506351D.4040109@canterbury.ac.nz>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
	<4506351D.4040109@canterbury.ac.nz>
Message-ID: <740c3aec0609121556x12586796wa0888b8284eb94f@mail.gmail.com>

> > The idea of a standard edu library though is a GREAT one. That would
> > provide a standard place for things like raw_input() (with a better
> > name) as well as lots of other "helper functions" useful to beginners
> > and/or students -- and all it would cost is a single line of boilerplate
> > at the top of each program ("from beginnerlib import *" or something
> > like that).
>
> I disagree for two reasons:
>
> 1) Even a single line of boilerplate is too much
> when you're trying to pare things down to the
> bare minimum for a beginner.
>
> 2) It teaches a bad habit right from the
> beginning (i.e. using 'import *'). This is the
> wrong foot to start a beginner off on.

I agree. For an absolute newbie, Pythons import semantics are way, WAY
down the road long after variables, numbers, strings, comments,
control statements, functions etc. A third reason is that if these
functions are packages in a beginnerlib module, then you would have to
type "from beginnerlib import *" each and every time you want to use
raw_input() from the Python console.

-- 
mvh Bj?rn

From steven.bethard at gmail.com  Wed Sep 13 04:47:02 2006
From: steven.bethard at gmail.com (Steven Bethard)
Date: Tue, 12 Sep 2006 20:47:02 -0600
Subject: [Python-3000] educational aspects of Python 3000
In-Reply-To: <45072D7B.4020202@ewtllc.com>
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com>
	<bbaeab100609111126x1d57997dhe6e99c7810e94fdb@mail.gmail.com>
	<4506BA08.7090905@gmail.com> <4506BA63.7040201@gmail.com>
	<4506C342.6010202@ewtllc.com> <ee77gu$3sa$1@sea.gmane.org>
	<45072D7B.4020202@ewtllc.com>
Message-ID: <d11dcfba0609121947g752b4f03wab49b7a09bddc8fd@mail.gmail.com>

On 9/12/06, Raymond Hettinger <rhettinger at ewtllc.com> wrote:
> Ron Adam wrote:
> >Maybe "input" can be depreciated in 2.x with a messages to use eval(raw_input())
> >instead.  That would limit some of the confusion.
>
> Let me take this opportunity to articulate a principle that I hope this
> group will adopt, "Thou shalt not muck-up Py2.x in the name of Py3k."

I agree 100% with this principle.  But "input" could definitely get a
warning when the Python 2.x --warn-me-about-python-3-incompatibilities
switch is given.  Guido's already suggested that, for example, using
the result of dict.items() for anything other than iteration should
issue such a warning.

STeVe
-- 
I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a
tiny blip on the distant coast of sanity.
        --- Bucky Katt, Get Fuzzy

From martin at v.loewis.de  Wed Sep 13 06:56:33 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Sep 2006 06:56:33 +0200
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com>
References: <ed8pd9$ch$1@sea.gmane.org>
	<44F8FEED.9000600@gmail.com>	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	<44FC4B5B.9010508@blueyonder.co.uk>	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<1cb725390609051317p9b75fb4l1e3068b0f42f121f@mail.gmail.com>
Message-ID: <45078F81.4010506@v.loewis.de>

Paul Prescod schrieb:
> I haven't created locale-relevant content in a generic text editor in a
> very, very long time.

You are an atypical user, then. I use plain text files all the time, and
I know other people do as well.

Regards,
Martin

From martin at v.loewis.de  Wed Sep 13 06:38:30 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Sep 2006 06:38:30 +0200
Subject: [Python-3000] string C API
In-Reply-To: <ed968c$d03$1@sea.gmane.org>
References: <ed968c$d03$1@sea.gmane.org>
Message-ID: <45078B46.90408@v.loewis.de>

Fredrik Lundh schrieb:
> just noticed that PEP 3100 says that PyString_AsEncodedString and
> PyString_AsDecodedString is to be removed, but it doesn't mention
> any other PyString (or PyUnicode) functions.
> 
> how large changes can we make here, really ?

All API that refers to the internal representation should be
changed or removed; in theory, that could be all API that has
char* arguments.

For example, PyString_From{String[AndSize]|Format} would either:
- have to grow an encoding argument
- assume a default encoding (either ASCII or UTF-8)
- change its signature to operate on Py_UNICODE* (although
  we don't have literals for these) or
- be removed

Likewise, PyString_AsString either goes away or changes its
return type.

String APIs that operate on PyObject* likely can stay as-is.

Regards,
Martin

From martin at v.loewis.de  Wed Sep 13 06:10:38 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Sep 2006 06:10:38 +0200
Subject: [Python-3000] Character Set Indepencence
In-Reply-To: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com>
References: <1cb725390609010711p37586956q1ec570a2d8c4196d@mail.gmail.com>
Message-ID: <450784BE.7050802@v.loewis.de>

Paul Prescod schrieb:
> I think that the gist of it is that Unicode will be "just one character
> set" supported by Ruby. This idea has been kicked around for Python
> before but you quickly run into questions about how you compare
> character strings from multiple character sets, to say nothing of the
> complexity of an character encoding and character set agnostic regular
> expression engine.

As Guido says, the arguments for "CSI (character set independence)"
are hardly convincing. Yes, there are cases where Unicode doesn't
"round-trip", but they are so obscure that they (IMO) can be ignored
safely.

There are two problems in this respect with Unicode:
- in some cases, a character set may contain characters that are
  not included in Unicode. This was a serious problem for a while
  for Chinese for quite some time, but I believe this is now
  fixed, with the plane-2 additions. If just round-tripping is
  the goal, then it is always possible for a codec to map characters
  to the private-use areas of Unicode. This is not optimal,
  since a different codec may give a different meaning to the
  same PUA characters, but there should be rarely a need to
  use them in the first place.

- in some cases, the input encoding has multiple representations
  for what becomes the same character in Unicode. For example,
  in ISO-2022-jp, there are three ways to encode the latin
  letters (either in ASCII, or in the romaji part of
  either JIS X 0208-1978 or JIS X 0208-1983). You can switch
  between these in a single string; if you go back and forth
  through Unicode, you get a normalized version that
  .encode gives you. While I have seen people bringing it
  up now and then, I don't recall anybody claiming that this
  is a real, practical problem.

There is a third problem that people often associate with
Unicode: due to the Han unification, you don't know whether
a certain Han character originates from Chinese, Japanese,
or Korean. This is a problem when rendering Unicode: you
don't know what glyphs to use (as you should use different
glyphs depending on the natural language). With CSI, you
can use a "language-aware encoding": you use a Japanese
encoding for Japanese text, and so on, then use the encoding
to determine what the language is.

For Unicode, there are several ways to deal with it:
- you could carry language information along with the
  original text. This is what is commonly done in the
  web: you put language information into the HTML,
  and then use that to render the text correctly.
- you could embed language information into the Unicode
  string, using the plane-14 tag characters. This
  should work fairly nicely, since you only need
  a single piece of information, but has some drawbacks:
  * you need four-byte Unicode, or surrogates
  * if you slice such a string, the slices won't
    carry the language tag
  * applications today typically don't know how to
    deal with tag characters
- you could guess the language from the content, based
  on the frequency of characters (e.g. presence
  of katakana/hiragana would indicate that it is
  Japanese). As with all guessing, there are
  cases where it fails. I believe that web browsers
  commonly apply that approach, anyway.

Regards,
Martin

From martin at v.loewis.de  Wed Sep 13 06:44:54 2006
From: martin at v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Sep 2006 06:44:54 +0200
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <ed8pd9$ch$1@sea.gmane.org>
References: <ed8pd9$ch$1@sea.gmane.org>
Message-ID: <45078CC6.7050505@v.loewis.de>

Fredrik Lundh schrieb:
> today's Python supports "locale aware" 8-bit strings; e.g.
> 
>     >>> import locale
>     >>> "???".isalpha()
>     False
>     >>> locale.setlocale(locale.LC_ALL, "sv_SE")
>     'sv_SE'
>     >>> "???".isalpha()
>     True
> 
> to what extent should this be supported by Python 3000 ?

I would like to see locale-aware operations, but with an
explicit locale, e.g.

import locale
l = locale.load(locale.LC_ALL, "sv_SE")
print l.isalpha("???")

(i.e. character properties become locale methods,
not string methods).

To implement that, we would have to incorporate ICU,
which would be a tough decision to make (or have our own
implementation based on the tables that ICU uses).

Alternatively, we could try to get such locale objects
from system APIs where available (e.g. <xlocale.h>
in glibc), and not provide them on systems that don't
have locale objects in their APIs.

Regards,
Martin

From martin at v.loewis.de  Wed Sep 13 06:51:30 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Sep 2006 06:51:30 +0200
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <44FDBDFF.7090505@sweetapp.com>
References: <ed8pd9$ch$1@sea.gmane.org>	<44F8FEED.9000600@gmail.com>	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	<44FC4B5B.9010508@blueyonder.co.uk>	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FDBDFF.7090505@sweetapp.com>
Message-ID: <45078E52.9080402@v.loewis.de>

Brian Quinlan schrieb:
> As a user, I don't have any expectations regarding non-ASCII text files.
> 
> I'm using a US-English version of Windows XP (very common) and I haven't 
> changed the default encoding (very common). Python claims that my system 
> encoding is CP436 (from sys.stdin/stdout.encoding).

You are misinterpreting the data you see. Python makes no claims about
your system encoding in sys.stdout.encoding. Instead, it makes a claim
about your terminal's encoding, and that is indeed CP436 (just do
"type foo.txt" with a document that contains non-ASCII characters,
and watch the characters in the terminal look differently from the
ones in notepad).

It is an unfortunate fact that Windows has *two* system encodings: one
used for "Windows", and one used for the "OEM". The terminal uses the
OEM code page (by default, unless you run chcp.exe).

> I can assure you
> that most of the documents that I work with are not in CP436 - they are 
> a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that 
> this is true of many Windows XP (US-English) users. So, for me and users 
> like me, Python is going to silently misinterpret my data.

No. It will use a different API to determine the system encoding, and
it will guess correctly.

Regards,
Martin

From martin at v.loewis.de  Wed Sep 13 06:20:12 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Sep 2006 06:20:12 +0200
Subject: [Python-3000] encoding hell
In-Reply-To: <1d85506f0609021529o3a83dccbod0a7a643d39da696@mail.gmail.com>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>	<44F9E844.2020603@acm.org>
	<1d85506f0609021529o3a83dccbod0a7a643d39da696@mail.gmail.com>
Message-ID: <450786FC.2020808@v.loewis.de>

tomer filiba schrieb:
> # read 3 UTF8 *characters*
> f.read(3)
> 
> # this will seek by AT LEAST 7 *bytes*, until resynched
> f.substream.seekby(7)
> 
> # we can resume reading of UTF8 *characters*
> f.read(3)
> 
> heck, i even like this idea :)

Notice that resyncing is a really tricky operation, and
should not be expected to work for all encodings. For
example, for the iso-2022 encodings, you have to know
what character set you are "in", and you have to read
forward/backward until you find a character-code switching
escape sequence.

There is an RFC-imposed requirement that each line
of input is "neutral" wrt. character set switching,
so you can typically synchronize at a line break. Still,
this could require to skip an arbitrary amount of text.

Regards,
Martin

From martin at v.loewis.de  Wed Sep 13 06:53:52 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Sep 2006 06:53:52 +0200
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <44FE1A65.7020900@blueyonder.co.uk>
References: <ed8pd9$ch$1@sea.gmane.org>	<44F8FEED.9000600@gmail.com>	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	<44FC4B5B.9010508@blueyonder.co.uk>	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>	<44FDBDFF.7090505@sweetapp.com>	<ca471dc20609051213g7addc8f8y81ac22f0dae10073@mail.gmail.com>
	<44FE1A65.7020900@blueyonder.co.uk>
Message-ID: <45078EE0.8090906@v.loewis.de>

David Hopwood schrieb:
> Cp436 is almost certainly *not* the encoding set by the OS; Python
> has got it wrong.

Just to repeat myself: Python is *not* wrong, the terminal *indeed*
uses CP 436.

> If Brian is using an English-language variant of
> Windows XP and has not changed the defaults, the system ("ANSI")
> encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1
> if C1 control characters are not used).

Yes, and the OEM encoding will be CP 436. It is common to interpret
CP_ACP as the system encoding, yet Windows has two of them, and Python
knows very well which one to use in which place.

Regards,
Martin

From martin at v.loewis.de  Wed Sep 13 06:22:16 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Sep 2006 06:22:16 +0200
Subject: [Python-3000] encoding hell
In-Reply-To: <ede6m9$c9g$1@sea.gmane.org>
References: <1d85506f0609020853y6cba1b85h552157c2b2614ca2@mail.gmail.com>	<44FA0E59.9010302@canterbury.ac.nz>
	<ede6m9$c9g$1@sea.gmane.org>
Message-ID: <45078778.6000407@v.loewis.de>

Fredrik Lundh schrieb:
>> The best you could do would be to return some kind
>> of opaque object from tell() that could be passed
>> back to seek().
> 
> that's how seek/tell works on text files in today's Python, of course. 
> if you're writing portable code, you can only seek to the beginning or 
> end of the file, or to a position returned to you by tell.

The problem is that for character-oriented streams, that position
should also incorporate the "shift state" of the codec. To support
that, the codec API would need to grow a way to export and import
its state into such "tell objects".

Regards,
Martin

From paul at prescod.net  Wed Sep 13 08:10:29 2006
From: paul at prescod.net (Paul Prescod)
Date: Tue, 12 Sep 2006 23:10:29 -0700
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <45078E52.9080402@v.loewis.de>
References: <ed8pd9$ch$1@sea.gmane.org> <44F8FEED.9000600@gmail.com>
	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>
	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>
	<44FC4B5B.9010508@blueyonder.co.uk>
	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>
	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>
	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FDBDFF.7090505@sweetapp.com> <45078E52.9080402@v.loewis.de>
Message-ID: <1cb725390609122310k44b99f9eqb12aec7c5fadf886@mail.gmail.com>

On 9/12/06, "Martin v. L?wis" <martin at v.loewis.de> wrote:
>
>
> > I can assure you
> > that most of the documents that I work with are not in CP436 - they are
> > a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that
> > this is true of many Windows XP (US-English) users. So, for me and users
> > like me, Python is going to silently misinterpret my data.
>
> No. It will use a different API to determine the system encoding, and
> it will guess correctly.


If Python reports "cp1252" as I expect it to, then it has not "guessed
correctly" for Brian's documents as described above. The mistake will be
harmless for the ASCII files and often for the ISO8859-1 files, but would be
dangerous for the UTF-8 ones.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060912/fa4bea07/attachment.html 

From brian at sweetapp.com  Wed Sep 13 10:06:41 2006
From: brian at sweetapp.com (Brian Quinlan)
Date: Wed, 13 Sep 2006 10:06:41 +0200
Subject: [Python-3000] locale-aware strings ?
In-Reply-To: <45078E52.9080402@v.loewis.de>
References: <ed8pd9$ch$1@sea.gmane.org>	<44F8FEED.9000600@gmail.com>	<fb6fbf560609031622n17c9a126h132ad88bf9e474e8@mail.gmail.com>	<ca471dc20609031911p39673696g34668f107bab942f@mail.gmail.com>	<44FC4B5B.9010508@blueyonder.co.uk>	<ca471dc20609041432m7cbe6db8wbb68a4c5fd4401a4@mail.gmail.com>	<1cb725390609050908m6150c1cas533a98f6a14cbcd0@mail.gmail.com>	<ca471dc20609050952h662570efrd3bd509e19d0849e@mail.gmail.com>
	<44FDBDFF.7090505@sweetapp.com> <45078E52.9080402@v.loewis.de>
Message-ID: <4507BC11.8040901@sweetapp.com>

Martin v. L?wis wrote:
>> I can assure you
>> that most of the documents that I work with are not in CP436 - they are 
>> a combination of ASCII, ISO8859-1, and UTF-8. I would also guess that 
>> this is true of many Windows XP (US-English) users. So, for me and users 
>> like me, Python is going to silently misinterpret my data.
> 
> No. It will use a different API to determine the system encoding, and
> it will guess correctly.

You are addressing a completely different issue. I am saying that Python 
is going to silently misinterpret my *data* and you are saying that it 
is going to correctly determine the *system encoding*.

As a user, I don't directly care if Python guesses my system encoding 
correctly or not, but I do care if it tries to interpret my UTF-8 
documents as Windows-1252 (which will succeed) and I end up 
transmitting/storing/displaying incorrect data.

Cheers,
Brian


From ajm at flonidan.dk  Wed Sep 13 10:27:31 2006
From: ajm at flonidan.dk (Anders J. Munch)
Date: Wed, 13 Sep 2006 10:27:31 +0200
Subject: [Python-3000] iostack, second revision
Message-ID: <9B1795C95533CA46A83BA1EAD4B01030031F54@flonidanmail.flonidan.net>

Josiah Carlson wrote:
> "Anders J. Munch" <ajm at flonidan.dk> wrote:
> > I don't expect file methods and systems calls to map one to one, but
> > you're right, the first time the length is needed, that's an extra
> > system call.
> 
> Every time the length is needed, a system call is required 
> (you can have
> multiple writers of the same file)...

Point taken.  It's very rarely a good idea to do so, but the
possibility of multiple writers shouldn't be ignored.  Still there is
no real performance issue.  If anything, replacing
f.seek(0,2);f.tell() with f.length in various places might save a few
system calls.

> 
> Flushing during seek is important.  By not flushing during 
> seek in your
> FileBytes object, you are unnecessarily delaying writes, which could
> cause file corruption.

That's what the flush method is for.  The real reason seek implies
flush is to save the library author the bother of getting the
interactions between input and output buffering right.
Anyway, FileBytes has no seek and no concept of current file position,
so I really don't know what you're talking about :)

- Anders

From john at yates-sheets.org  Wed Sep 13 15:24:00 2006
From: john at yates-sheets.org (John S. Yates, Jr.)
Date: Wed, 13 Sep 2006 09:24:00 -0400
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
	<1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>
	<87r6yij7ea.fsf@qrnik.zagroda>
	<1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>
Message-ID: <u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>

On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote:

> UTF-8 with BOM is the Microsoft preferred format.

I believe this is a gloss.  Microsoft uses UTF-16.  Because
the basic character unit is larger than one byte it is crucial
for interoperability to prefix a string of UTF-16 text with an
indication of the order of bytes in each two byte unit.  This
is the role of the BOM.  The BOM is not part of the text.  It
is a wrapper or envelope.

It is a mistake on Microsoft's part to fail to strip the BOM
during conversion to UTF-8.  There is no MEANINGFUL definition
of BOM in a UTF-8 string.  But instead of stripping the wrapper
and converting only the text payload Microsoft lazily treats
both the wrapper and its payload as text.

You can see the logical fallacy if you imagine emitting UTF-16
text in an environment of one byte sex, reducing that text to
UTF-8, carrying it to an environment of the other byte sex and
raising it back to UTF-16.  The Unicode.org assumption is that
on generation one organizes the bytes of UTF-16 or UTF-32 units
according to what is most convenient for a given environment.
One prefixes a BOM to text objects to be persisted or passed
to differing byte-sex environments.  Such an object is not a
string but a means of inter-operation.

If the BOMs are not stripped during reduction to UTF-8 and are
reconstituted during raising to UTF-16 or UTF-32 then raising
must honor the BOM and the Unicode.org efficiency objective is
subverted.

You can take this further and imagine concatenating two UTF-8
strings, one originally UTF-16 generated in a little-endian
environment, the other originally UTF-16 generated in a big-
endian environment.  If the BOMs are not pre-stripped then
during raising of the concatenated result to UTF-16 you will
get an object with embedded BOMs.  This is not meaningful.
What does it mean within a UTF-16 string to encounter a BOM
that contradicts the wrapper/envelope?  Does this mean that
any correct UTF-16 utility much cope with hybrid object whose
byte order potentially changes mid-stride?

/john, who has written a database loader that has to contend
with (and clearly diagnoses) BOM in UTF-8 strings.





From jimjjewett at gmail.com  Wed Sep 13 15:34:47 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 13 Sep 2006 09:34:47 -0400
Subject: [Python-3000] string C API
In-Reply-To: <45078B46.90408@v.loewis.de>
References: <ed968c$d03$1@sea.gmane.org> <45078B46.90408@v.loewis.de>
Message-ID: <fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>

On 9/13/06, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> Fredrik Lundh schrieb:
> > just noticed that PEP 3100 says that PyString_AsEncodedString and
> > PyString_AsDecodedString is to be removed, but it doesn't mention
> > any other PyString (or PyUnicode) functions.

> > how large changes can we make here, really ?

> All API that refers to the internal representation should be
> changed or removed; in theory, that could be all API that has
> char* arguments.

This is sufficient to allow polymorphic strings -- including strings
whose data is implemented as a view into some other object.

> For example, PyString_From{String[AndSize]|Format} would either:
> - have to grow an encoding argument
> - assume a default encoding (either ASCII or UTF-8)
> - change its signature to operate on Py_UNICODE* (although
>   we don't have literals for these) or
> - be removed

Should encoding be an attribute of the string?

If so, should recoding require the creation of a new string (in the
new encoding)?

-jJ

From qrczak at knm.org.pl  Wed Sep 13 15:37:05 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Wed, 13 Sep 2006 15:37:05 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com> (John S. Yates,
	Jr.'s message of "Wed, 13 Sep 2006 09:24:00 -0400")
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
	<1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>
	<87r6yij7ea.fsf@qrnik.zagroda>
	<1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>
	<u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>
Message-ID: <87u03bq45a.fsf@qrnik.zagroda>

"John S. Yates, Jr." <john at yates-sheets.org> writes:

> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8.  There is no MEANINGFUL definition
> of BOM in a UTF-8 string.  But instead of stripping the wrapper
> and converting only the text payload Microsoft lazily treats
> both the wrapper and its payload as text.

The Unicode standard is at fault too.

It specifies UTF-16 and UTF-32 in variants:

- UTF-{16,32} with an optional BOM (defaulting to big endian if the
  BOM is not present), where the BOM is mandatory if the first
  character of the contents is U+FEFF (otherwise it would be mistaken
  as a BOM).

- UTF-{16,32}{LE,BE} with a fixed endianness and without a BOM;
  a U+FEFF in UTF-16BE must not be interpreted as a BOM, it's always
  a part of the text.

The problem is that it's not clear in the case of UTF-8. Formally it
doesn't have a BOM, but the standard includes some ambiguous wording
that various software uses UTF-8 BOM and the presence of a BOM should
not affect the interpretation. It should clearly distinguish two
interpretations of UTF-8: one without the concept of a BOM, and one
which permits the BOM (and in fact makes it mandatory if the stream
begins with U+FEFF).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From martin at v.loewis.de  Wed Sep 13 17:15:47 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Sep 2006 17:15:47 +0200
Subject: [Python-3000] string C API
In-Reply-To: <fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>
References: <ed968c$d03$1@sea.gmane.org> <45078B46.90408@v.loewis.de>
	<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>
Message-ID: <450820A3.4000302@v.loewis.de>

Jim Jewett schrieb:
>> For example, PyString_From{String[AndSize]|Format} would either:
>> - have to grow an encoding argument
>> - assume a default encoding (either ASCII or UTF-8)
>> - change its signature to operate on Py_UNICODE* (although
>>   we don't have literals for these) or
>> - be removed
> 
> Should encoding be an attribute of the string?

No. A Python string is a sequence of Unicode characters.
Even if it was created by converting from some other encoding,
that original encoding gets lost when doing the conversion
(just like integers don't remember which base they were originally
represented in).

Regards,
Martin

From jcarlson at uci.edu  Wed Sep 13 18:21:52 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 13 Sep 2006 09:21:52 -0700
Subject: [Python-3000] iostack, second revision
In-Reply-To: <9B1795C95533CA46A83BA1EAD4B01030031F54@flonidanmail.flonidan.net>
References: <9B1795C95533CA46A83BA1EAD4B01030031F54@flonidanmail.flonidan.net>
Message-ID: <20060913084256.F930.JCARLSON@uci.edu>


"Anders J. Munch" <ajm at flonidan.dk> wrote:
> Josiah Carlson wrote:
> > "Anders J. Munch" <ajm at flonidan.dk> wrote:
> > > I don't expect file methods and systems calls to map one to one, but
> > > you're right, the first time the length is needed, that's an extra
> > > system call.
> > 
> > Every time the length is needed, a system call is required 
> > (you can have
> > multiple writers of the same file)...
> 
> Point taken.  It's very rarely a good idea to do so, but the
> possibility of multiple writers shouldn't be ignored.  Still there is
> no real performance issue.  If anything, replacing
> f.seek(0,2);f.tell() with f.length in various places might save a few
> system calls.

Any sane person uses os.stat(f.name) or os.fstat(f.fileno()), unless
they want to seek to the end of the file for later writing or expected
reading of data yet-to-be-written.  Interesting that both of these cases
basically read and write to the same file at the same time (perhaps even
in the same process), something you yourself said, "In all my
programming days I don't believe I written to and read from the same
file handle even once. Use cases exist, like if you're implementing a
DBMS..."


> > Flushing during seek is important.  By not flushing during 
> > seek in your
> > FileBytes object, you are unnecessarily delaying writes, which could
> > cause file corruption.
> 
> That's what the flush method is for.  The real reason seek implies
> flush is to save the library author the bother of getting the
> interactions between input and output buffering right.
> Anyway, FileBytes has no seek and no concept of current file position,
> so I really don't know what you're talking about :)

I was talking about your earlier statement, which I quoted in my earlier
reply to you:

> My micro-optimisation circuitry blew a fuse when I discovered that
> seek always implies flush.  You won't get good performance out of code
> that does a lot of seeks, whatever you do.  Use my upcoming FileBytes
> class :)

And with the context of a previous message from you:

> FileBytes would support the sequence protocol, mimicking bytes objects.
> It would support random-access read and write using __getitem__ and
> __setitem__, allowing slice assignment for slices of equal size.  And
> there would be append() to extend the file, and partial __delitem__
> support for truncating.

While it doesn't have the methods seek or tell, the underlying
implementation needs to use seek and tell (or a memory-mapped file, mmap). 
You were also talking about buffering writes to reduce the overhead of
the underlying seeks and tells because of apparent "optimizations" you
wanted to make. Here is a data integrity optimization you can make for
me: flush when accessing the file non-sequentially, any other behavior
could corrupt the data of users who have been relying on "seek implies
flush".


I would also mention that your FileBytes class is essentially a fake
memory-mapped file, and while I also have implemented an equivalent
class (for low-memory testing purposes in a DBMS-like situation), I find
that using an mmap to be far faster and generally more reliable (and
usable with buffer()) than my FileBytes equivalent, never mind that the
vast majority of users don't want a sequence interface to a file, they
want a stream interface; which is why you don't see many FileBytes-like
objects out in the wild, or really anyone suggesting such a wrapper
object be in the standard library.

With that said, I'm not sure your FileBytes object is really necessary
or desired for the future io library.  If people want that kind of an
interface, they can use mmap (and push for the various mmap bugs/feature
requests to be fixed), otherwise they should be using readable /
writable / both streams, something that Tomer has been working towards.


 - Josiah


From jcarlson at uci.edu  Wed Sep 13 18:41:01 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 13 Sep 2006 09:41:01 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>
References: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>
	<u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>
Message-ID: <20060913092509.F933.JCARLSON@uci.edu>


"John S. Yates, Jr." <john at yates-sheets.org> wrote:
> 
> On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote:
> 
> > UTF-8 with BOM is the Microsoft preferred format.
> 
> I believe this is a gloss.  Microsoft uses UTF-16.  Because
> the basic character unit is larger than one byte it is crucial
> for interoperability to prefix a string of UTF-16 text with an
> indication of the order of bytes in each two byte unit.  This
> is the role of the BOM.  The BOM is not part of the text.  It
> is a wrapper or envelope.
> 
> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8.  There is no MEANINGFUL definition
> of BOM in a UTF-8 string.  But instead of stripping the wrapper
> and converting only the text payload Microsoft lazily treats
> both the wrapper and its payload as text.

I have actually had a variant of this particular discussion with Walter
D?rwald.  He brought up RCF 3629...

[Walter D?rwald]
I don't think it does. RFC 3629 isn't that clear about whether an
initial 0xEF 0xBB 0xBF sequence is to be interpreted as an encoding
signature or a ZWNBSP. But I think the following part of RFC 3629
applies here for Python source code:

   o  A protocol SHOULD also forbid use of U+FEFF as a signature for
      those textual protocol elements for which the protocol provides
      character encoding identification mechanisms, when it is expected
      that implementations of the protocol will be in a position to
      always use the mechanisms properly.  This will be the case when
      the protocol elements are maintained tightly under the control of
      the implementation from the time of their creation to the time of
      their (properly labeled) transmission.

[My reply, slightly altered for this context]
Because not all tools that may manipulate data consumed and/or produced
by Python follow the coding: directive, then "the protocol elements" are
not 'tightly maintained', so the inclusion of a "BOM" for utf-8 is a
necessary "protocol element", at least for .py files, and certainly
suggested for other file types that _may not have_ the equivalent of a
Python coding: directive.


Explicit is better than implicit, and in this case we have the
opportunity to be explicit about the "envelope" or "the protocol
elements", which will guarantee proper interpretation by non-braindead
software.  Braindead software that doesn't understand a utf-* BOM should
be fixed by the developer or eschewed.


> You can take this further and imagine concatenating two UTF-8
> strings, one originally UTF-16 generated in a little-endian
> environment, the other originally UTF-16 generated in a big-
> endian environment.  If the BOMs are not pre-stripped then
> during raising of the concatenated result to UTF-16 you will
> get an object with embedded BOMs.  This is not meaningful.

And is generally ignored, as per unicode spec; it's a "zero width
non-breaking space" - an invisible character with no effect on wrapping
or otherwise.

> What does it mean within a UTF-16 string to encounter a BOM
> that contradicts the wrapper/envelope?  Does this mean that
> any correct UTF-16 utility much cope with hybrid object whose
> byte order potentially changes mid-stride?

Unless you are doing something wrong (like literally concatenating the
byte representations of a utf-16be and utf-16le encoded text), this
won't happen.


> /john, who has written a database loader that has to contend
> with (and clearly diagnoses) BOM in UTF-8 strings.

Being that BOMs are only supposed to be seen as a BOM if they are
literally the first few bytes in a string, I certainly hope you didn't
spend too much time on that support.

 - Josiah (who has written an editor with support for all UTF variants
with BOM, and UTF-8 + all other localized encodings using coding:
directives)


From paul at prescod.net  Wed Sep 13 18:44:18 2006
From: paul at prescod.net (Paul Prescod)
Date: Wed, 13 Sep 2006 09:44:18 -0700
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
	<1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>
	<87r6yij7ea.fsf@qrnik.zagroda>
	<1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>
	<u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>
Message-ID: <1cb725390609130944k20c315c1k482ff1bd7cc5a85a@mail.gmail.com>

On 9/13/06, John S. Yates, Jr. <john at yates-sheets.org> wrote:
>
> On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote:
>
> > UTF-8 with BOM is the Microsoft preferred format.
>
> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8.  There is no MEANINGFUL definition
> of BOM in a UTF-8 string.


That is not true.

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
yes, then can I still assume the remaining UTF-8 bytes are in big-endian
order? A: Yes, UTF-8 can contain a BOM. However, it makes *no* difference as
to the endianness of the byte stream. UTF-8 always has the same byte order.
An initial BOM is *only* used as a signature ? an indication that an
otherwise unmarked text file is in UTF-8.

This is a very valuable function and applications like Microsoft's Notepad,
Apple's TextEdit and VIM take good advantage of it.

"""

Vim will try to detect what kind of file you are editing.  It uses the
encoding names in the 'fileencodings'
<http://www.vim.org/htmldoc/options.html#%27fileencodings%27> option.
When using Unicode <http://www.vim.org/htmldoc/mbyte.html#Unicode>,
the default
value is: "ucs-bom,utf-8,latin1".  This means that Vim checks the file to see
if it's one of these encodings:

	ucs-bom		File must start with a Byte Order Mark
<http://www.vim.org/htmldoc/motion.html#Mark> (BOM).  This
			allows detection of 16-bit, 32-bit and utf-8
<http://www.vim.org/htmldoc/mbyte.html#utf-8> Unicode
<http://www.vim.org/htmldoc/mbyte.html#Unicode>
			encodings.
	utf-8 <http://www.vim.org/htmldoc/mbyte.html#utf-8>		utf-8
<http://www.vim.org/htmldoc/mbyte.html#utf-8> Unicode
<http://www.vim.org/htmldoc/mbyte.html#Unicode>.  This is rejected
when a sequence of
			bytes is illegal in utf-8 <http://www.vim.org/htmldoc/mbyte.html#utf-8>.

	latin1		The good old 8-bit encoding.

"""

I'm pretty much proposing this same algorithm for Python's encoding
guessing.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060913/87ce70e9/attachment.html 

From jimjjewett at gmail.com  Wed Sep 13 19:09:27 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 13 Sep 2006 13:09:27 -0400
Subject: [Python-3000] string C API
In-Reply-To: <450820A3.4000302@v.loewis.de>
References: <ed968c$d03$1@sea.gmane.org> <45078B46.90408@v.loewis.de>
	<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>
	<450820A3.4000302@v.loewis.de>
Message-ID: <fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>

On 9/13/06, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> > Should encoding be an attribute of the string?

> No. A Python string is a sequence of Unicode characters.
> Even if it was created by converting from some other encoding,
> that original encoding gets lost when doing the conversion
> (just like integers don't remember which base they were originally
> represented in).

Theoretically, it is a sequence of code points.

Today, in python 2.x, these are always represented by a specific
(wide, fixed-width) concrete encoding, chosen at compile time.  This
is required so long as outside code can access the data buffer
directly.

It would no longer be required if all access were through unicode
methods.  (And it would probably make sense to have a
"get-me-the-buffer-in-this-encoding" method.)

Several people seem to want more efficient representations when possible.

Several people seem to want UTF-8, which makes sense if the rest of
the system is UTF8, but complicates the implementation.

Simply not encoding/decoding until required would save quite a bit of
time and space -- but then the object would need some way of
indicating which encoding it is in.

-jJ

From martin at v.loewis.de  Wed Sep 13 19:14:30 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Sep 2006 19:14:30 +0200
Subject: [Python-3000] string C API
In-Reply-To: <fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>
References: <ed968c$d03$1@sea.gmane.org> <45078B46.90408@v.loewis.de>	
	<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>	
	<450820A3.4000302@v.loewis.de>
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>
Message-ID: <45083C76.8010302@v.loewis.de>

Jim Jewett schrieb:
> Simply not encoding/decoding until required would save quite a bit of
> time and space -- but then the object would need some way of
> indicating which encoding it is in.

Try implementing that some time. You'll find it will be incredibly
complex and unmaintainable. Start with implementing len(s).

Regards,
Martin


From jimjjewett at gmail.com  Wed Sep 13 19:27:28 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 13 Sep 2006 13:27:28 -0400
Subject: [Python-3000] string C API
In-Reply-To: <45083C76.8010302@v.loewis.de>
References: <ed968c$d03$1@sea.gmane.org> <45078B46.90408@v.loewis.de>
	<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>
	<450820A3.4000302@v.loewis.de>
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>
	<45083C76.8010302@v.loewis.de>
Message-ID: <fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>

On 9/13/06, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> Jim Jewett schrieb:
> > Simply not encoding/decoding until required would save quite a bit of
> > time and space -- but then the object would need some way of
> > indicating which encoding it is in.

> Try implementing that some time. You'll find it will be incredibly
> complex and unmaintainable. Start with implementing len(s).

Simply delegate such methods to a hidden per-encoding subclass.

The UTF-8 methods will indeed be complex, unless the solution is
simply "someone called indexing/slicing/len, so I have to recode after
all."

The Latin-1 encoding will have no such problem.

-jJ

From guido at python.org  Wed Sep 13 20:06:05 2006
From: guido at python.org (Guido van Rossum)
Date: Wed, 13 Sep 2006 11:06:05 -0700
Subject: [Python-3000] sys.stdin and sys.stdout with textfile
In-Reply-To: <45062AD2.1090207@canterbury.ac.nz>
References: <1157898432.4246.161.camel@fsol>
	<ca471dc20609101004t2d55b686x4908c39981467106@mail.gmail.com>
	<45062AD2.1090207@canterbury.ac.nz>
Message-ID: <ca471dc20609131106n23158790pc1255ddb97abaf0c@mail.gmail.com>

On 9/11/06, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Guido van Rossum wrote:
>
> > All sorts of things are different when reading stdin vs. opening a
> > filename. e.g. stdin may be a pipe.
>
> Which suggests that if anything is going to try
> to guess the encoding, it would be better for it
> to start reading from the actual stream you're
> going to use and buffer the result, rather than
> rely on being able to open it separately.

Right. The filename is useless. The stream may or may not be seekable
(sometimes even stdin is!). Having a buffering layer in between would
make it possible to peek ahead in the buffer.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From martin at v.loewis.de  Wed Sep 13 20:09:42 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 13 Sep 2006 20:09:42 +0200
Subject: [Python-3000] string C API
In-Reply-To: <fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
References: <ed968c$d03$1@sea.gmane.org> <45078B46.90408@v.loewis.de>	
	<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>	
	<450820A3.4000302@v.loewis.de>	
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>	
	<45083C76.8010302@v.loewis.de>
	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
Message-ID: <45084966.3000608@v.loewis.de>

Jim Jewett schrieb:
> Simply delegate such methods to a hidden per-encoding subclass.
> 
> The UTF-8 methods will indeed be complex, unless the solution is
> simply "someone called indexing/slicing/len, so I have to recode after
> all."
> 
> The Latin-1 encoding will have no such problem.

I'm not so much worried about UTF-8 or Latin-1; they are fairly trivial.
Efficiency of such methods for multi-byte encodings would be
dramatically slow.

Regards,
Martin

From jason.orendorff at gmail.com  Wed Sep 13 20:23:33 2006
From: jason.orendorff at gmail.com (Jason Orendorff)
Date: Wed, 13 Sep 2006 14:23:33 -0400
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
	<1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>
	<87r6yij7ea.fsf@qrnik.zagroda>
	<1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>
	<u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>
Message-ID: <bb8868b90609131123k649f0eaay9f17be9674fefde@mail.gmail.com>

On 9/13/06, John S. Yates, Jr. <john at yates-sheets.org> wrote:
> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8.

John, you're mistaken about the reason this BOM is here.

In Notepad at least, the BOM is intentionally generated when writing
the file.  It's not a "mistake" or "laziness".  It's metadata.  (I
admit the BOM was not originally invented for this purpose.)

> There is no MEANINGFUL definition of BOM in a UTF-8
> string.

This thread is about files, not strings.  At the start of a file, a
UTF-8 BOM is meaningful.  It means the file is UTF-8.

On Windows, there's a system default encoding, and it's never UTF-8.
Notepad writes the BOM so that later, when you open the file in
Notepad again, it can identify the file as UTF-8.

> You can see the logical fallacy if you imagine emitting UTF-16
> text in an environment of one byte sex, reducing that text to
> UTF-8, carrying it to an environment of the other byte sex and
> raising it back to UTF-16.

It sounds as if you think this will corrupt the BOM, but it works fine:

 >>> import codecs
 # "Emitting UTF-16 text" in little-endian environment
 >>> s1 = codecs.BOM_UTF16_LE + u'hello world'.encode('utf-16-le')
 # "Reducing that text to UTF-8"
 >>> s2 = s1.decode('utf-16-le').encode('utf-8')
 >>> s2
 '\xef\xbb\xbfhello world'
 # "Raising it back to UTF-16" in big-endian environment
 >>> s3 = s2.decode('utf-8').encode('utf-16-be')
 >>> s3[:2] == codecs.BOM_UTF16_BE
 True

The BOM is still correct: the data is UTF-16-BE, and the BOM agrees.

A UTF-8 string or file will contain exactly the same bytes (including
the BOM, if any) whether it is generated from UTF-16-BE or -LE.  All
three are lossless representations in bytes of the same abstract
ideal, which is a sequence of Unicode codepoints.

-j

From rasky at develer.com  Wed Sep 13 22:09:38 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Wed, 13 Sep 2006 22:09:38 +0200
Subject: [Python-3000] educational aspects of Python 3000
References: <20060911112215.odruja6fdgl0kcg4@login.werra.lunarpages.com><4506351D.4040109@canterbury.ac.nz>
	<740c3aec0609121556x12586796wa0888b8284eb94f@mail.gmail.com>
Message-ID: <05df01c6d770$8a97c8f0$43492597@bagio>

BJ?rn Lindqvist <bjourne at gmail.com> wrote:

>>> The idea of a standard edu library though is a GREAT one.
>>> [...]

>> I disagree for two reasons:
>>
>> 1) Even a single line of boilerplate is too much
>> when you're trying to pare things down to the
>> bare minimum for a beginner.
>>
>> 2) It teaches a bad habit right from the
>> beginning (i.e. using 'import *'). This is the
>> wrong foot to start a beginner off on.
>
> I agree. For an absolute newbie, Pythons import semantics are way, WAY
> down the road long after variables, numbers, strings, comments,
> control statements, functions etc. A third reason is that if these
> functions are packages in a beginnerlib module, then you would have to
> type "from beginnerlib import *" each and every time you want to use
> raw_input() from the Python console.

Another solution would be to have a special "python --edu" command line options
which automatically star-import the beginnerlib before the interactive mode
starts. Or a PYTHONEDU=1 env. Or a custom site.py which patches __builtins__.

Giovanni Bajo


From solipsis at pitrou.net  Wed Sep 13 22:33:22 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Wed, 13 Sep 2006 22:33:22 +0200
Subject: [Python-3000] BOM handling
In-Reply-To: <20060913092509.F933.JCARLSON@uci.edu>
References: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>
	<u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>
	<20060913092509.F933.JCARLSON@uci.edu>
Message-ID: <1158179602.4721.24.camel@fsol>


Le mercredi 13 septembre 2006 ? 09:41 -0700, Josiah Carlson a ?crit :
> And is generally ignored, as per unicode spec; it's a "zero width
> non-breaking space" - an invisible character with no effect on wrapping
> or otherwise.

Well it would be better if Py3K (with all strings unicode) makes things
easy for the programmer and abstracts away those "invisible characters
with no textual meaning". Currently it's not the case:

>>> a = "hello".decode("utf-8")
>>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8")
>>> len(a)
5
>>> len(b)
6
>>> a == b
False

>>> a = "hello".encode("utf-16le").decode("utf-16le")
>>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le")
>>> len(a)
5
>>> len(b)
6
>>> a == b
False
>>> a
u'hello'
>>> b
u'\ufeffhello'
>>> print a
hello
>>> print b
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.4/encodings/iso8859_15.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>


Regards

Antoine.



From g.brandl at gmx.net  Wed Sep 13 22:45:27 2006
From: g.brandl at gmx.net (Georg Brandl)
Date: Wed, 13 Sep 2006 22:45:27 +0200
Subject: [Python-3000] BOM handling
In-Reply-To: <1158179602.4721.24.camel@fsol>
References: <1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>	<u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>	<20060913092509.F933.JCARLSON@uci.edu>
	<1158179602.4721.24.camel@fsol>
Message-ID: <ee9ql8$6e2$1@sea.gmane.org>

Antoine Pitrou wrote:
> Le mercredi 13 septembre 2006 ? 09:41 -0700, Josiah Carlson a ?crit :
>> And is generally ignored, as per unicode spec; it's a "zero width
>> non-breaking space" - an invisible character with no effect on wrapping
>> or otherwise.
> 
> Well it would be better if Py3K (with all strings unicode) makes things
> easy for the programmer and abstracts away those "invisible characters
> with no textual meaning". Currently it's not the case:
> 
>>>> a = "hello".decode("utf-8")
>>>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8")
>>>> len(a)
> 5
>>>> len(b)
> 6
>>>> a == b
> False

This behavior is questionable...

>>>> a = "hello".encode("utf-16le").decode("utf-16le")
>>>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le")
>>>> len(a)
> 5
>>>> len(b)
> 6

... while this is IMHO not. UTF-16LE does not have a BOM as byte order is already
specified by the encoding. The correct example is

b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16")

b then equals u"hello", as it should.

"hello".encode("utf-16") prepends a BOM itself.

Georg


From walter at livinglogic.de  Thu Sep 14 00:05:31 2006
From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=)
Date: Thu, 14 Sep 2006 00:05:31 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <bb8868b90609131123k649f0eaay9f17be9674fefde@mail.gmail.com>
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>	<1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>	<87r6yij7ea.fsf@qrnik.zagroda>	<1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>	<u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>
	<bb8868b90609131123k649f0eaay9f17be9674fefde@mail.gmail.com>
Message-ID: <450880AB.1040104@livinglogic.de>

Jason Orendorff wrote:
> On 9/13/06, John S. Yates, Jr. <john at yates-sheets.org> wrote:
>> It is a mistake on Microsoft's part to fail to strip the BOM
>> during conversion to UTF-8.
> 
> John, you're mistaken about the reason this BOM is here.
> 
> In Notepad at least, the BOM is intentionally generated when writing
> the file.  It's not a "mistake" or "laziness".  It's metadata.  (I
> admit the BOM was not originally invented for this purpose.)

In theory it's only metadata if external information says that it is, it 
practice it's unlikely that a charmap encoded file begins with these 
three bytes. nevertheless it's only a hint.

>> There is no MEANINGFUL definition of BOM in a UTF-8
>> string.
> 
> This thread is about files, not strings.  At the start of a file, a
> UTF-8 BOM is meaningful.  It means the file is UTF-8.

... and the first "character" in the file is U+FEFF. If you want the 
codec to drop the BOM on reading, use the UTF-8-Sig codec.

> [...]

Servus,
    Walter


From jcarlson at uci.edu  Thu Sep 14 01:14:29 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 13 Sep 2006 16:14:29 -0700
Subject: [Python-3000] BOM handling
In-Reply-To: <1158179602.4721.24.camel@fsol>
References: <20060913092509.F933.JCARLSON@uci.edu>
	<1158179602.4721.24.camel@fsol>
Message-ID: <20060913153900.F936.JCARLSON@uci.edu>


Antoine Pitrou <solipsis at pitrou.net> wrote:
> 
> 
> Le mercredi 13 septembre 2006 ? 09:41 -0700, Josiah Carlson a ?crit :
> > And is generally ignored, as per unicode spec; it's a "zero width
> > non-breaking space" - an invisible character with no effect on wrapping
> > or otherwise.
> 
> Well it would be better if Py3K (with all strings unicode) makes things
> easy for the programmer and abstracts away those "invisible characters
> with no textual meaning". Currently it's not the case:

> >>> a = "hello".decode("utf-8")
> >>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8")
> >>> len(a)
> 5
> >>> len(b)
> 6
> >>> a == b
> False

I had also had this particular discussion with another individual
previously (but I can't seem to find it in my archive), and one point
brought up was that apparently Python 2.5 was supposed to have a variant
codec for utf-8 that automatically stripped at most one \ufeff character
from the beginning of decoded output and added it during encoding,
similar to how the generic 'utf-16' and 'utf-32' codecs add and strip:

>>> u'hello'.encode('utf-16')
'\xff\xfeh\x00e\x00l\x00l\x00o\x00'
>>> len(u'hello'.encode('utf-16').decode('utf-16'))
5
>>> 

I'm unable to find that particular utf-8 codec in the version of Python
2.5 I have installed, but I may not be looking in the right places, or
spelling it the right way.

In any case, I believe that the above behavior is correct for the
context.  Why?  Because utf-8 has no endianness, its 'generic' decoding
spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and
'utf-16-le' decoding spellings; two of which don't strip.


> >>> a = "hello".encode("utf-16le").decode("utf-16le")
> >>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le")
> >>> len(a)
> 5
> >>> len(b)
> 6
> >>> a == b
> False

Georg Brandl responded to this example already.


> >>> a
> u'hello'
> >>> b
> u'\ufeffhello'
> >>> print a
> hello
> >>> print b
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>   File "/usr/lib/python2.4/encodings/iso8859_15.py", line 18, in encode
>     return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>

There are two answers to this particular "problem".  Either that is
expected and desireable behavior for all non-utf encodings, or all
non-utf encodings need to gain a mapping of the feff code point to the
empty string.  I think the behavior is expected and desireable.  Why?
Because none of the non-utf encodings have a valid and round-trip-able
representation for the feff code point.

Also, if you want to print possibly arbitrary unicode strings to the
console, you may consider encoding the unicode string first, offering
either 'ignore' or 'replace' as the second argument.

 - Josiah


From david.nospam.hopwood at blueyonder.co.uk  Thu Sep 14 01:36:50 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Thu, 14 Sep 2006 00:36:50 +0100
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <bb8868b90609131123k649f0eaay9f17be9674fefde@mail.gmail.com>
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>	<1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>	<87r6yij7ea.fsf@qrnik.zagroda>	<1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>	<u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>
	<bb8868b90609131123k649f0eaay9f17be9674fefde@mail.gmail.com>
Message-ID: <45089612.4060007@blueyonder.co.uk>

Jason Orendorff wrote:
> On 9/13/06, John S. Yates, Jr. <john at yates-sheets.org> wrote:
> 
>>It is a mistake on Microsoft's part to fail to strip the BOM
>>during conversion to UTF-8.
> 
> John, you're mistaken about the reason this BOM is here.
> 
> In Notepad at least, the BOM is intentionally generated when writing
> the file.  It's not a "mistake" or "laziness".  It's metadata.  (I
> admit the BOM was not originally invented for this purpose.)
> 
>>There is no MEANINGFUL definition of BOM in a UTF-8
>>string.
> 
> This thread is about files, not strings.  At the start of a file, a
> UTF-8 BOM is meaningful.  It means the file is UTF-8.
> 
> On Windows, there's a system default encoding, and it's never UTF-8.

The Windows system encoding can be UTF-8, but only for some locales
recently added in Windows 2000/XP, where there was no compatibility
constraint to use a non-Unicode encoding.

You're correct about the use of a BOM as a signature. All Unicode-conformant
applications should accept this use of a BOM in UTF-8 (although they need
not generate it); the standard is quite clear on that.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From solipsis at pitrou.net  Thu Sep 14 08:19:00 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Thu, 14 Sep 2006 08:19:00 +0200
Subject: [Python-3000] BOM handling
In-Reply-To: <20060913153900.F936.JCARLSON@uci.edu>
References: <20060913092509.F933.JCARLSON@uci.edu>
	<1158179602.4721.24.camel@fsol> <20060913153900.F936.JCARLSON@uci.edu>
Message-ID: <1158214740.5863.19.camel@fsol>


Hi,

Le mercredi 13 septembre 2006 ? 16:14 -0700, Josiah Carlson a ?crit :
> In any case, I believe that the above behavior is correct for the
> context.  Why?  Because utf-8 has no endianness, its 'generic' decoding
> spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and
> 'utf-16-le' decoding spellings; two of which don't strip.

Your opinion is probably valid in a theoretical point of view. You are
more knowledgeable than me.

My point was different : most programmers are not at your level (or
Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type
is supposed to be an abstracted textual type to make it easy to write
unicode-friendly applications (isn't it?).
Therefore it should hide the messy issue of superfluous BOMs, unwanted
BOMs, etc. Telling the programmer to use a specific UTF-8 variant
specialized in BOM-stripping will make eyes roll... "why doesn't the
standard UTF-8 do it for me?"

Regards

Antoine.



From walter at livinglogic.de  Thu Sep 14 09:12:21 2006
From: walter at livinglogic.de (=?ISO-8859-1?Q?Walter_D=F6rwald?=)
Date: Thu, 14 Sep 2006 09:12:21 +0200
Subject: [Python-3000] BOM handling
In-Reply-To: <20060913153900.F936.JCARLSON@uci.edu>
References: <20060913092509.F933.JCARLSON@uci.edu>	<1158179602.4721.24.camel@fsol>
	<20060913153900.F936.JCARLSON@uci.edu>
Message-ID: <450900D5.6050606@livinglogic.de>

Josiah Carlson wrote:
> Antoine Pitrou <solipsis at pitrou.net> wrote:
>>
>> Le mercredi 13 septembre 2006 ? 09:41 -0700, Josiah Carlson a ?crit :
>>> And is generally ignored, as per unicode spec; it's a "zero width
>>> non-breaking space" - an invisible character with no effect on wrapping
>>> or otherwise.
>> Well it would be better if Py3K (with all strings unicode) makes things
>> easy for the programmer and abstracts away those "invisible characters
>> with no textual meaning". Currently it's not the case:
> 
>>>>> a = "hello".decode("utf-8")
>>>>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8")
>>>>> len(a)
>> 5
>>>>> len(b)
>> 6
>>>>> a == b
>> False
> 
> I had also had this particular discussion with another individual
> previously (but I can't seem to find it in my archive), and one point
> brought up was that apparently Python 2.5 was supposed to have a variant
> codec for utf-8 that automatically stripped at most one \ufeff character
> from the beginning of decoded output and added it during encoding,
> similar to how the generic 'utf-16' and 'utf-32' codecs add and strip:
> 
>>>> u'hello'.encode('utf-16')
> '\xff\xfeh\x00e\x00l\x00l\x00o\x00'
>>>> len(u'hello'.encode('utf-16').decode('utf-16'))
> 5
> 
> I'm unable to find that particular utf-8 codec in the version of Python
> 2.5 I have installed, but I may not be looking in the right places, or
> spelling it the right way.

It's called "utf-8-sig".

> In any case, I believe that the above behavior is correct for the
> context.  Why?  Because utf-8 has no endianness, its 'generic' decoding
> spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and
> 'utf-16-le' decoding spellings; two of which don't strip.

Servus,
    Walter


From talin at acm.org  Thu Sep 14 10:04:33 2006
From: talin at acm.org (Talin)
Date: Thu, 14 Sep 2006 01:04:33 -0700
Subject: [Python-3000] BOM handling
In-Reply-To: <1158214740.5863.19.camel@fsol>
References: <20060913092509.F933.JCARLSON@uci.edu>	<1158179602.4721.24.camel@fsol>
	<20060913153900.F936.JCARLSON@uci.edu>
	<1158214740.5863.19.camel@fsol>
Message-ID: <45090D11.3060908@acm.org>

Antoine Pitrou wrote:
> Hi,
> 
> Le mercredi 13 septembre 2006 ? 16:14 -0700, Josiah Carlson a ?crit :
>> In any case, I believe that the above behavior is correct for the
>> context.  Why?  Because utf-8 has no endianness, its 'generic' decoding
>> spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and
>> 'utf-16-le' decoding spellings; two of which don't strip.
> 
> Your opinion is probably valid in a theoretical point of view. You are
> more knowledgeable than me.
> 
> My point was different : most programmers are not at your level (or
> Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type
> is supposed to be an abstracted textual type to make it easy to write
> unicode-friendly applications (isn't it?).
> Therefore it should hide the messy issue of superfluous BOMs, unwanted
> BOMs, etc. Telling the programmer to use a specific UTF-8 variant
> specialized in BOM-stripping will make eyes roll... "why doesn't the
> standard UTF-8 do it for me?"

I've been reading this thread (and the ones that spawned it), and 
there's something about it that's been nagging at me for a while, which 
I am going to attempt to articulate.

The basic controversy centers around the various ways in which Python 
should attempt to deal with character encodings on various platforms, 
but my question is "for what use cases?" To my mind, trying to ask "how 
should we handle character encoding" without indicating what we want to 
use the characters *for* is a meaningless question.

 From the standpoint of a programmer writing code to process file 
contents, there's really no such thing as a "text file" - there are only 
various text-based file formats. There are XML files, .ini files, email 
messages and Python source code, all of which need to be processed 
differently.

So when one asks "how do I handle text files", my response is "there 
ain't no such thing" -- and when you ask "well, ok, how do I handle 
text-based file formats", my response is "well it depends on the format".

Yes, there are some operations which can operate on textual data 
regardless of file format (i.e. grep), but these generic operations are 
so basic and uninteresting that one generally doesn't need to write 
Python code to do them. And even the case of simple unix utilities such 
as 'cat', *some* a priori knowledge of the file's encoded meaning is 
required - you can't just concatenate two XML files and get anything 
meaningful or valid. Running 'sort' on Python source code is unlikely to 
   increase shareholder value or otherwise hold back the tide of entropy.

Any given Python program that I write is going to know *something* about 
the format of the files that it is supposed to read/write, and the most 
important consideration is knowledge of what kinds of other programs are 
going to produce or consume that file. If the file that I am working 
with conforms to a standard (so that the number of producer/consumer 
programs can be large without me having to know the specific details of 
each one) then I need to understand that standard and constraints of 
what is legal within it.

For files with any kind of structure in them, common practice is that we 
don't treat them as streams of characters, rather we generally have some 
abstraction layer that sits on top of the character stream and allows us 
to work with the structure directly. Thus, when dealing with XML one 
generally uses something like ElementTree, and in fact manipulating XML 
files as straight text is actively discouraged.

So my whole approach to the problem of reading and writing is to come up 
with a collection of APIs that reflect the common use patterns for the 
various popular file types. The benefit of doing this is that you don't 
waste time thinking about all of the various file operations that don't 
apply to a particular file format. For example, using the ElementTree 
interface, I don't care whether the underlying file stream supports 
seek() or not - generally one doesn't seek into the middle of an XML, so 
there's no need to support that feature. On the other hand, if one is 
reading a bdb file, one needs to seek to the location of a record in 
order to read it - but in such a case, the result of the seek operation 
is well-defined. I don't have to spend time discussing what will happen 
if I seek into the middle of an encoded multi-byte character, because 
with a bdb file, that can't happen.

It seems to me that a lot of the conundrums that have been discussed in 
this thread have to do with hypothetical use cases - 'Well, what if I 
use operation X on a file of format Y, for which the result is 
undefined?' My answer is "Don't do that."

-- Talin

From ncoghlan at gmail.com  Thu Sep 14 12:19:46 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 14 Sep 2006 20:19:46 +1000
Subject: [Python-3000] string C API
In-Reply-To: <45084966.3000608@v.loewis.de>
References: <ed968c$d03$1@sea.gmane.org>
	<45078B46.90408@v.loewis.de>		<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>		<450820A3.4000302@v.loewis.de>		<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>		<45083C76.8010302@v.loewis.de>	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
	<45084966.3000608@v.loewis.de>
Message-ID: <45092CC2.4070700@gmail.com>

Martin v. L?wis wrote:
> Jim Jewett schrieb:
>> Simply delegate such methods to a hidden per-encoding subclass.
>>
>> The UTF-8 methods will indeed be complex, unless the solution is
>> simply "someone called indexing/slicing/len, so I have to recode after
>> all."
>>
>> The Latin-1 encoding will have no such problem.
> 
> I'm not so much worried about UTF-8 or Latin-1; they are fairly trivial.
> Efficiency of such methods for multi-byte encodings would be
> dramatically slow.

Only the first such call on a given string, though - the idea is to use lazy 
decoding, not to avoid decoding altogether. Most manipulations (len, indexing, 
slicing, concatenation, etc) would require decoding to at least UCS-2 (or 
perhaps UCS-4).

It's applications that are just schlepping bits around that would benefit from 
the lazy decoding behaviour.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From qrczak at knm.org.pl  Thu Sep 14 14:44:28 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Thu, 14 Sep 2006 14:44:28 +0200
Subject: [Python-3000] string C API
In-Reply-To: <45092CC2.4070700@gmail.com> (Nick Coghlan's message of "Thu,
	14 Sep 2006 20:19:46 +1000")
References: <ed968c$d03$1@sea.gmane.org> <45078B46.90408@v.loewis.de>
	<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>
	<450820A3.4000302@v.loewis.de>
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>
	<45083C76.8010302@v.loewis.de>
	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>
Message-ID: <8764fq8vo3.fsf@qrnik.zagroda>

Nick Coghlan <ncoghlan at gmail.com> writes:

> Only the first such call on a given string, though - the idea
> is to use lazy decoding, not to avoid decoding altogether.
> Most manipulations (len, indexing, slicing, concatenation, etc)
> would require decoding to at least UCS-2 (or perhaps UCS-4).

Silently optimizing string recoding might change the way recoding
errors are reported. i.e. they might not be reported at all even
if the string is malformed. Optimizations which change the semantics
are bad.

I imagine only a few cases where lazy decoding would be beneficial:

1. A whole input stream is copied to an output stream which uses the
   same encoding.

   Here the application might choose to copy binary streams instead.

2. A file name, user name, or similar token is obtained from the OS
   in one place and used in another place. Especially on Unix where
   they use byte encodings (Windows prefers UTF-16).

   These cases can be optimized by other means:

   - Sometimes representing the token as a Python string can be
     avoided. For example executing an action in a different directory
     and then returning to the original directory might choose to
     represent the saved directory as a byte array.

   - Under the assumption that the system encoding is ASCII-compatible,
     calling the recoding machinery can be omitted for ASCII-only strings.
     This applies only to strings exchanged with the OS etc., not to
     stream contents which can use non-ASCII-compatible encodings.

My language implementation has only two string representations:
ISO-8859-1 and UTF-32 (the narrow representation is used for all
strings where it's possible). This is completely transparent to the
high level semantics, like the fixnum/bignum split. I'm happy with
this choice.

My text I/O buffers and recoding buffers use UTF-32 exclusively.
It would be too complicated to try to use a narrow representation
when the string is not processed as a whole. This makes the ASCII-only
optimization significant I believe (but I haven't measured it).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From solipsis at pitrou.net  Thu Sep 14 14:48:56 2006
From: solipsis at pitrou.net (Antoine)
Date: Thu, 14 Sep 2006 14:48:56 +0200 (CEST)
Subject: [Python-3000] string C API
In-Reply-To: <45092CC2.4070700@gmail.com>
References: <ed968c$d03$1@sea.gmane.org>
	<45078B46.90408@v.loewis.de>		<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>		<450820A3.4000302@v.loewis.de>		<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>		<45083C76.8010302@v.loewis.de>	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>
Message-ID: <51134.62.39.9.251.1158238136.squirrel@webmail.nerim.net>


> Only the first such call on a given string, though - the idea is to use
> lazy
> decoding, not to avoid decoding altogether. Most manipulations (len,
> indexing,
> slicing, concatenation, etc) would require decoding to at least UCS-2 (or
> perhaps UCS-4).

My two cents:

For len() you can compute the length at string construction and store it
in the string object (which is immutable). For example if the string is
constructed by concatenation then computing the resulting length should be
trivial. Even when real computation is needed, it plays nicer with the CPU
cache since the data has to be there anyway.

As for concatenation, recoding can be avoided if the strings to be
concatenated use the same internal encoding (assuming it does not hold
internal state). Given that in many cases the strings will come from
similar sources (thus use the same internal encoding), it may be an
interesting optimization.

Regards

Antoine.



From p.f.moore at gmail.com  Thu Sep 14 14:50:49 2006
From: p.f.moore at gmail.com (Paul Moore)
Date: Thu, 14 Sep 2006 13:50:49 +0100
Subject: [Python-3000] BOM handling
In-Reply-To: <45090D11.3060908@acm.org>
References: <20060913092509.F933.JCARLSON@uci.edu>
	<1158179602.4721.24.camel@fsol> <20060913153900.F936.JCARLSON@uci.edu>
	<1158214740.5863.19.camel@fsol> <45090D11.3060908@acm.org>
Message-ID: <79990c6b0609140550j287792ex468ff93407a6d4ac@mail.gmail.com>

On 9/14/06, Talin <talin at acm.org> wrote:
> I've been reading this thread (and the ones that spawned it), and
> there's something about it that's been nagging at me for a while, which
> I am going to attempt to articulate.
[...]
> Any given Python program that I write is going to know *something* about
> the format of the files that it is supposed to read/write, and the most
> important consideration is knowledge of what kinds of other programs are
> going to produce or consume that file. If the file that I am working
> with conforms to a standard (so that the number of producer/consumer
> programs can be large without me having to know the specific details of
> each one) then I need to understand that standard and constraints of
> what is legal within it.

Well said!

There *is* still an issue, which is that Python needs to supply tools
to cater for naive users writing naive programs to parse/produce
ad-hoc text based file formats. For example, someone sent me this file
of data, and I want to parse it and convert it into some other format
(load it into a database, generate XML, whaterver). In my experience,
in these cases:

1. Nobody tells me the character encoding used.
2. 99.9% of the data is ASCII - so there's very little basis for guessing.
3. The whole process isn't an exact science - I *expect* to have to do
a bit of manual tidying up.

Or it's all ASCII and it *really* doesn't matter.

Those are the bulk of my use cases. For them, I'd be happy with the
"system code page" (even though Windows has two, one for console and
one for GUI, that wouldn't bother me if it was visible to me). I
wouldn't mind UTF-8, or latin-1, or anything much. It's only that 0.1%
of cases where I expect to need to check and possibly intervene, so no
problem.

On the other hand, getting an error *would* bother me. In Python 2.x,
I get no error because I don't convert to Unicode. In Python 3.x, I
fear that I might, because someone expects me to care about that 0.1%.
And no, it's not good enough for me to be able to set a global option
- that's boilerplate I'd rather do without.

Parochially y'rs
Paul.

From qrczak at knm.org.pl  Thu Sep 14 15:01:23 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Thu, 14 Sep 2006 15:01:23 +0200
Subject: [Python-3000] Pre-PEP: Easy Text File Decoding
In-Reply-To: <45089612.4060007@blueyonder.co.uk> (David Hopwood's message of
	"Thu, 14 Sep 2006 00:36:50 +0100")
References: <20060911094447.hjabms9i7sgs4g8s@login.werra.lunarpages.com>
	<79990c6b0609111449j1ae99757m35037b27c5727eda@mail.gmail.com>
	<1cb725390609111530o3f7b0faah7d87a8cdf532cf39@mail.gmail.com>
	<87r6yij7ea.fsf@qrnik.zagroda>
	<1cb725390609111816k7187a15eu91c4029dad65885d@mail.gmail.com>
	<u1vfg2dnld13h9qh9pm62c4k032ivt7kic@4ax.com>
	<bb8868b90609131123k649f0eaay9f17be9674fefde@mail.gmail.com>
	<45089612.4060007@blueyonder.co.uk>
Message-ID: <871wqe8uvw.fsf@qrnik.zagroda>

David Hopwood <david.nospam.hopwood at blueyonder.co.uk> writes:

> You're correct about the use of a BOM as a signature. All
> Unicode-conformant applications should accept this use of a BOM in
> UTF-8 (although they need not generate it); the standard is quite
> clear on that.

When a program generates a list of filenames in a file, and I do
   xargs -i cp {} some-dir/ <filenames-file
and one file is not found because a UTF-8 BOM has been inserted before
its name, I won't blame xargs. I will blame the program which geneated
the filenames. Or the language it is written in, if it didn't create
the BOM explicitly.

                          *       *       *

A tricky issue is handling filenames which can't be decoded.

I'm willing to blame myself when the list of filenames contains names
which can't be decoded using the locale encoding, because I know no
good solution to the problem of representing arbitrary Linux filenames
as Unicode strings.

Some people would blame the program or the language.

OTOH there exist libraries which believe that all filenames should be
UTF-8, irrespective of the locale. In particular Gnome used to require
setting the environment variable G_BROKEN_FILENAMES when filenames are
not UTF-8 (now G_FILENAME_ENCODING can be set). I disagree with them.

This applies to Linux. I think MacOS uses UTF-8 filenames, so the
story is different there.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From bwinton at latte.ca  Thu Sep 14 15:13:19 2006
From: bwinton at latte.ca (Blake Winton)
Date: Thu, 14 Sep 2006 09:13:19 -0400
Subject: [Python-3000] BOM handling
In-Reply-To: <45090D11.3060908@acm.org>
References: <20060913092509.F933.JCARLSON@uci.edu>	<1158179602.4721.24.camel@fsol>	<20060913153900.F936.JCARLSON@uci.edu>	<1158214740.5863.19.camel@fsol>
	<45090D11.3060908@acm.org>
Message-ID: <4509556F.4030508@latte.ca>

Talin wrote:
>> My point was different : most programmers are not at your level (or
>> Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type
>> is supposed to be an abstracted textual type to make it easy to write
>> unicode-friendly applications (isn't it?).
> 
> The basic controversy centers around the various ways in which Python 
> should attempt to deal with character encodings on various platforms, 
> but my question is "for what use cases?" To my mind, trying to ask "how 
> should we handle character encoding" without indicating what we want to 
> use the characters *for* is a meaningless question.

Contrary to all expectations, this thread has helped me in my day job 
already.  I'm about to start writing a program (in Python, natch) which 
will take a set of files, and perform simple token substitution on them, 
replacing tokens of the form %STUFF.format% with the value of the STUFF 
token looked up in another (XML, thus Unicode by the time it gets to me) 
file.

The files I'll be substituting in will be in various encodings, and I'll 
be creating new files which must have the same encoding.  Sadly, I don't 
know what all the encodings are.  (The Windows Resource Compiler takes 
in .rc files, but I can't find any suggestion of what encoding those 
use.  Anyone here know?)

The first version of the spec naively mentioned nothing about encodings, 
and so I raised a red flag about that, seeing that we would have 
problems, and that the right thing to do in this case isn't clear.

Um, what more data do we need for this use-case?  I'm not going to 
suggest an API, other than it would be nice if I didn't have to manually 
figure out/hard code all the encodings.  (It's my belief that I will 
currently have to do that, or at least special-case XML, to read the 
encoding attribute.)  Oh, and it would be particularly horrible if I 
output a shell script in UTF-8, and it included the BOM, since I believe 
that would break the "magic number" of "#!".

(To test it in vim, set the following options:
:set encoding=utf-8
:set bomb
)

Jennifer:~ bwinton$ xxd test
0000000: efbb bf23 2120 2f62 696e 2f62 6173 680a  ...#! /bin/bash.
0000010: 6563 686f 204a 7573 7420 7465 7374 696e  echo Just testin
0000020: 672e 2e2e 0a                             g....
Jennifer:~ bwinton$ ./test
-bash: ./test: cannot execute binary file

Jennifer:~ bwinton$ xxd test
0000000: 2321 202f 6269 6e2f 6261 7368 0a65 6368  #! /bin/bash.ech
0000010: 6f20 4a75 7374 2074 6573 7469 6e67 2e2e  o Just testing..
0000020: 2e0a                                     ..
Jennifer:~ bwinton$ ./test
Just testing...

>  From the standpoint of a programmer writing code to process file 
> contents, there's really no such thing as a "text file" - there are only 
> various text-based file formats. There are XML files, .ini files, email 
> messages and Python source code, all of which need to be processed 
> differently.

Yeah, see, at a business level, I really need to process those all in 
the same way, and it would be annoying to have to write code to handle 
them all differently.

> For files with any kind of structure in them, common practice is that we 
> don't treat them as streams of characters, rather we generally have some 
> abstraction layer that sits on top of the character stream and allows us 
> to work with the structure directly.

Your common practice, perhaps.  I find myself treating them as streams 
of characters as often as not, because I neither need nor care to 
process the structure.  Heck, even in my source code, I grep more often 
than I use the fancy "Find Usages" button (if only because PyDev in 
Eclipse doesn't let me search for all the usages of a function).

> So my whole approach to the problem of reading and writing is to come up 
> with a collection of APIs that reflect the common use patterns for the 
> various popular file types.

That sounds great.  Can you also come up with an API for the files that 
you don't consider to be in common use?  And if so, that's the one that 
everyone is going to use.  (I'm not saying that to be contrary, but 
because I honestly believe that that's what's going to happen.  If 
there's a choice between using one API for all your files, and using n 
APIs for all your files, my money is always going to be on the one. 
Maybe XML will have enough traction to make it two, but certainly no 
more than that.)

Later,
Blake.

From jcarlson at uci.edu  Thu Sep 14 18:28:39 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Thu, 14 Sep 2006 09:28:39 -0700
Subject: [Python-3000] BOM handling
In-Reply-To: <4509556F.4030508@latte.ca>
References: <45090D11.3060908@acm.org> <4509556F.4030508@latte.ca>
Message-ID: <20060914092020.F954.JCARLSON@uci.edu>


Blake Winton <bwinton at latte.ca> wrote:
[snip]
> Um, what more data do we need for this use-case?  I'm not going to 
> suggest an API, other than it would be nice if I didn't have to manually 
> figure out/hard code all the encodings.  (It's my belief that I will 
> currently have to do that, or at least special-case XML, to read the 
> encoding attribute.)  Oh, and it would be particularly horrible if I 
> output a shell script in UTF-8, and it included the BOM, since I believe 
> that would break the "magic number" of "#!".

Use the XML tag/attribute "<?xml ... encoding="..." ?> to discover the
encoding and assume utf-8 otherwise as per spec:
http://www.w3.org/TR/2000/REC-xml-20001006#NT-EncodingDecl

Does bash natively support utf-8?  Is there a bash equivalent to Python
coding: directives?  You may be attempting to fix a problem that doesn't
exist.


> Yeah, see, at a business level, I really need to process those all in 
> the same way, and it would be annoying to have to write code to handle 
> them all differently.

So you, or anyone else, can write a module for discovering the encoding
used for a particular file based on XML tags, Python coding: directives,
etc. It could include an extensible registry, and if it is used enough,
could be included in the Python standard library.


 - Josiah


From jcarlson at uci.edu  Thu Sep 14 18:46:06 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Thu, 14 Sep 2006 09:46:06 -0700
Subject: [Python-3000] string C API
In-Reply-To: <8764fq8vo3.fsf@qrnik.zagroda>
References: <45092CC2.4070700@gmail.com> <8764fq8vo3.fsf@qrnik.zagroda>
Message-ID: <20060914093036.F957.JCARLSON@uci.edu>


"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> wrote:
> Nick Coghlan <ncoghlan at gmail.com> writes:
> 
> > Only the first such call on a given string, though - the idea
> > is to use lazy decoding, not to avoid decoding altogether.
> > Most manipulations (len, indexing, slicing, concatenation, etc)
> > would require decoding to at least UCS-2 (or perhaps UCS-4).
> 
> Silently optimizing string recoding might change the way recoding
> errors are reported. i.e. they might not be reported at all even
> if the string is malformed. Optimizations which change the semantics
> are bad.

This is not a problem.  During construction of the string, you would
either be recoding the original string to the standard 'compressed'
format, or if they had the same format, you would attempt a decoding,
and on failure, claim that the input wasn't in the encoding originally
specified.


Personally though, I'm not terribly inclined to believe that using a
'compressed' representation of utf-8 is desireable.  Why not use latin-1
when possible, ucs-2 when latin-1 isn't enough, and ucs-4 when ucs-2
isn't enough?  You get a fixed-width character encoding, and aside from
the (annoying) need to write variants of each string function for each
width (macros would help here), or generic versions of each, you never
need to recode the initial string after it has been created.

Even better, with a slightly modified buffer interface, these characters
can be exposed to C extensions in a somewhat transparent manner (if
desired).


 - Josiah


From bob at redivi.com  Thu Sep 14 18:47:17 2006
From: bob at redivi.com (Bob Ippolito)
Date: Thu, 14 Sep 2006 09:47:17 -0700
Subject: [Python-3000] string C API
In-Reply-To: <20060914093036.F957.JCARLSON@uci.edu>
References: <45092CC2.4070700@gmail.com> <8764fq8vo3.fsf@qrnik.zagroda>
	<20060914093036.F957.JCARLSON@uci.edu>
Message-ID: <6a36e7290609140947s6261456bv4e0f40733f1c0e5f@mail.gmail.com>

On 9/14/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> wrote:
> > Nick Coghlan <ncoghlan at gmail.com> writes:
> >
> > > Only the first such call on a given string, though - the idea
> > > is to use lazy decoding, not to avoid decoding altogether.
> > > Most manipulations (len, indexing, slicing, concatenation, etc)
> > > would require decoding to at least UCS-2 (or perhaps UCS-4).
> >
> > Silently optimizing string recoding might change the way recoding
> > errors are reported. i.e. they might not be reported at all even
> > if the string is malformed. Optimizations which change the semantics
> > are bad.
>
> This is not a problem.  During construction of the string, you would
> either be recoding the original string to the standard 'compressed'
> format, or if they had the same format, you would attempt a decoding,
> and on failure, claim that the input wasn't in the encoding originally
> specified.
>
>
> Personally though, I'm not terribly inclined to believe that using a
> 'compressed' representation of utf-8 is desireable.  Why not use latin-1
> when possible, ucs-2 when latin-1 isn't enough, and ucs-4 when ucs-2
> isn't enough?  You get a fixed-width character encoding, and aside from
> the (annoying) need to write variants of each string function for each
> width (macros would help here), or generic versions of each, you never
> need to recode the initial string after it has been created.
>
> Even better, with a slightly modified buffer interface, these characters
> can be exposed to C extensions in a somewhat transparent manner (if
> desired).

The argument for UTF-8 is probably interop efficiency. Lots of C
libraries, file formats, and wire protocols use UTF-8 for interchange.
Verifying the validity of UTF-8 during string creation isn't that big
of a deal.

-bob

From bwinton at latte.ca  Thu Sep 14 19:56:11 2006
From: bwinton at latte.ca (Blake Winton)
Date: Thu, 14 Sep 2006 13:56:11 -0400
Subject: [Python-3000] BOM handling
In-Reply-To: <20060914092020.F954.JCARLSON@uci.edu>
References: <45090D11.3060908@acm.org> <4509556F.4030508@latte.ca>
	<20060914092020.F954.JCARLSON@uci.edu>
Message-ID: <450997BB.6020703@latte.ca>

Josiah Carlson wrote:
> Blake Winton <bwinton at latte.ca> wrote:
>> I'm not going to 
>> suggest an API, other than it would be nice if I didn't have to manually 
>> figure out/hard code all the encodings.  (It's my belief that I will 
>> currently have to do that, or at least special-case XML, to read the 
>> encoding attribute.)
> Use the XML tag/attribute "<?xml ... encoding="..." ?> to discover the
> encoding and assume utf-8 otherwise as per spec:
> http://www.w3.org/TR/2000/REC-xml-20001006#NT-EncodingDecl

Yeah, but now you're requiring me to read and understand the file's 
contents, which is something I (as someone who doesn't particularly care 
about all this "encoding" stuff) am trying very hard not to do.  Does 
no-one write generic text processing programs anymore?

If I were to write a program which rotated an image using PIL, I 
wouldn't have to care whether it was a png or a jpeg.  (At least, I'm 
pretty sure I wouldn't.  I haven't tried recently.)

>> Oh, and it would be particularly horrible if I 
>> output a shell script in UTF-8, and it included the BOM, since I believe 
>> that would break the "magic number" of "#!".
> Does bash natively support utf-8?

A quick Google gives me:
-------------------------
About bash utf-8:
Bash is the shell, or command language interpreter, that will appear in 
the GNU operating system. It is default shell for BeOS.

By default, GNU bash assumes that every character is one byte long and 
one column wide. It may cause several problems for all non-english BeOS 
users, especially with file names using national characters. A patch for 
bash 2.04, by Marcin 'Qrczak' Kowalczyk and Ricardas Cepas, teaches bash 
about multibyte characters in UTF-8 encoding, and fixes those problems.
Double-width characters, combining characters and bidi are not supported 
by this patch.
-------------------------
which I'm mainly posting here because of the reference to Marcin 
'Qrczak' Kowalczyk.  Small world, but I wouldn't want to paint it.

 > Is there a bash equivalent to Python coding: directives?  You may be
 > attempting to fix a problem that doesn't exist.

I don't know if the magic number stuff to determine whether a file is 
executable or not is bash-specific.  Either way, when I save the file in 
UTF-8, it's fine, but when I save it in UTF-8 with a BOM, it fails.

>> Yeah, see, at a business level, I really need to process those all in 
>> the same way, and it would be annoying to have to write code to handle 
>> them all differently.
> So you, or anyone else, can write a module for discovering the encoding
> used for a particular file based on XML tags, Python coding: directives,
> etc. It could include an extensible registry, and if it is used enough,
> could be included in the Python standard library.

Okay, so what will happen for file types which aren't in the registry, 
like that Windows .rc files?

I was lying up above when I said that I don't care about this sort of 
thing.  I do care, but I also believe that I am, and should be, in the 
minority, and that if we can't ship something that will work for people 
who don't care about this stuff, then we've failed both them and Python.

Later,
Blake.

From paul at prescod.net  Thu Sep 14 20:12:10 2006
From: paul at prescod.net (Paul Prescod)
Date: Thu, 14 Sep 2006 11:12:10 -0700
Subject: [Python-3000] BOM handling
In-Reply-To: <20060914092020.F954.JCARLSON@uci.edu>
References: <45090D11.3060908@acm.org> <4509556F.4030508@latte.ca>
	<20060914092020.F954.JCARLSON@uci.edu>
Message-ID: <1cb725390609141112j6bc22220yd290d43e90c8501@mail.gmail.com>

As a somewhat aside: for XML encoding detection:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/363841

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060914/99111ff6/attachment.htm 

From jcarlson at uci.edu  Thu Sep 14 20:58:47 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Thu, 14 Sep 2006 11:58:47 -0700
Subject: [Python-3000] BOM handling
In-Reply-To: <450997BB.6020703@latte.ca>
References: <20060914092020.F954.JCARLSON@uci.edu> <450997BB.6020703@latte.ca>
Message-ID: <20060914112926.F95D.JCARLSON@uci.edu>


Blake Winton <bwinton at latte.ca> wrote:
> Josiah Carlson wrote:
> > Blake Winton <bwinton at latte.ca> wrote:
> >> I'm not going to 
> >> suggest an API, other than it would be nice if I didn't have to manually 
> >> figure out/hard code all the encodings.  (It's my belief that I will 
> >> currently have to do that, or at least special-case XML, to read the 
> >> encoding attribute.)
> > Use the XML tag/attribute "<?xml ... encoding="..." ?> to discover the
> > encoding and assume utf-8 otherwise as per spec:
> > http://www.w3.org/TR/2000/REC-xml-20001006#NT-EncodingDecl
> 
> Yeah, but now you're requiring me to read and understand the file's 
> contents, which is something I (as someone who doesn't particularly care 
> about all this "encoding" stuff) am trying very hard not to do.  Does 
> no-one write generic text processing programs anymore?

Not too long ago, "generic text processing programs" only had to deal
with one of ascii, ebdic, etc., or were written specifically for text
encoded for a particular locale.  Times have changed, but the tools
really haven't.  If you want to easily deal with such things, write the
module.


> If I were to write a program which rotated an image using PIL, I 
> wouldn't have to care whether it was a png or a jpeg.  (At least, I'm 
> pretty sure I wouldn't.  I haven't tried recently.)

Right, but gif, png, jpeg, bmp, and scores of other multimedia formats
contain the equivalent to a Python coding: directive. Examine the first
dozen or so bytes bytes of basically any kind of image, sound (not mp3s
though), or movie, and you will notice an ascii specifier for the type
of file.

By writing the registry module I described, one would be, in essence,
writing a library that understands what kind of media it has been handed,
at least as much as the equivalent of "this is a bmp" or "this is a gif".

>  > Is there a bash equivalent to Python coding: directives?  You may be
>  > attempting to fix a problem that doesn't exist.
> 
> I don't know if the magic number stuff to determine whether a file is 
> executable or not is bash-specific.  Either way, when I save the file in 
> UTF-8, it's fine, but when I save it in UTF-8 with a BOM, it fails.

So don't save it with a BOM and add a Python coding: directive to the
second line.  Python and bash comments just happen to have the same #
delimiter, and if your editor doesn't suck, then it should understand
such a directive.  With luck, your editor should also allow for the
non-writing of the BOM on utf-8 save (given certain conditions).  If not,
contact the author(s) and request that feature.


> > So you, or anyone else, can write a module for discovering the encoding
> > used for a particular file based on XML tags, Python coding: directives,
> > etc. It could include an extensible registry, and if it is used enough,
> > could be included in the Python standard library.
> 
> Okay, so what will happen for file types which aren't in the registry, 
> like that Windows .rc files?

I'm not writing the encoding registry, but if I was, and if no known
encoding was found, I'd claim latin-1, if only because it 'succeeds'
when decoding character values 128-255.

> I was lying up above when I said that I don't care about this sort of 
> thing.  I do care, but I also believe that I am, and should be, in the 
> minority, and that if we can't ship something that will work for people 
> who don't care about this stuff, then we've failed both them and Python.

Indeed, which is why people who do care should write a registry so that
their users don't need to care.

 - Josiah


From p.f.moore at gmail.com  Thu Sep 14 22:15:34 2006
From: p.f.moore at gmail.com (Paul Moore)
Date: Thu, 14 Sep 2006 21:15:34 +0100
Subject: [Python-3000] BOM handling
In-Reply-To: <20060914112926.F95D.JCARLSON@uci.edu>
References: <20060914092020.F954.JCARLSON@uci.edu> <450997BB.6020703@latte.ca>
	<20060914112926.F95D.JCARLSON@uci.edu>
Message-ID: <79990c6b0609141315h716ef623y9a67b36c4ac61cd2@mail.gmail.com>

On 9/14/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> So don't save it with a BOM and add a Python coding: directive to the
> second line.  Python and bash comments just happen to have the same #
> delimiter, and if your editor doesn't suck, then it should understand
> such a directive.

However, vim and emacs use *different* coding directive formats.
Python understands both, but (AFAIK) they don't understand each
other's. So which editor sucks? :-) :-) :-) (3 smileys is a
get-out-of-flamewar-free card :-))

I'm not trying to contradict you - just pointing out that the world
isn't as perfect as people here seem to want it to be.

Paul.

From jcarlson at uci.edu  Thu Sep 14 22:19:03 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Thu, 14 Sep 2006 13:19:03 -0700
Subject: [Python-3000] string C API
In-Reply-To: <6a36e7290609140947s6261456bv4e0f40733f1c0e5f@mail.gmail.com>
References: <20060914093036.F957.JCARLSON@uci.edu>
	<6a36e7290609140947s6261456bv4e0f40733f1c0e5f@mail.gmail.com>
Message-ID: <20060914104921.F95A.JCARLSON@uci.edu>


"Bob Ippolito" <bob at redivi.com> wrote:
> The argument for UTF-8 is probably interop efficiency. Lots of C
> libraries, file formats, and wire protocols use UTF-8 for interchange.
> Verifying the validity of UTF-8 during string creation isn't that big
> of a deal.

Indeed, UTF-8 validation/creation isn't a big deal.  But that wasn't my
concern.  My concern was Python-only operation efficiency, for which a
fixed-length-per-character encoding generally wins (at least for
operations involving two strings with the same internal encoding).


 - Josiah


From bob at redivi.com  Thu Sep 14 22:34:38 2006
From: bob at redivi.com (Bob Ippolito)
Date: Thu, 14 Sep 2006 13:34:38 -0700
Subject: [Python-3000] string C API
In-Reply-To: <20060914104921.F95A.JCARLSON@uci.edu>
References: <20060914093036.F957.JCARLSON@uci.edu>
	<6a36e7290609140947s6261456bv4e0f40733f1c0e5f@mail.gmail.com>
	<20060914104921.F95A.JCARLSON@uci.edu>
Message-ID: <6a36e7290609141334x344cf42fpa561275c123c290b@mail.gmail.com>

On 9/14/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Bob Ippolito" <bob at redivi.com> wrote:
> > The argument for UTF-8 is probably interop efficiency. Lots of C
> > libraries, file formats, and wire protocols use UTF-8 for interchange.
> > Verifying the validity of UTF-8 during string creation isn't that big
> > of a deal.
>
> Indeed, UTF-8 validation/creation isn't a big deal.  But that wasn't my
> concern.  My concern was Python-only operation efficiency, for which a
> fixed-length-per-character encoding generally wins (at least for
> operations involving two strings with the same internal encoding).

If you need to know the number of characters often you can calculate
that when the string's contents are validated. Slice ops may become
slower though... but versus UCS-4 the memory and memory bandwidth
savings might actually be a net performance win overall for many
applications.

-bob

From jason.orendorff at gmail.com  Thu Sep 14 22:53:58 2006
From: jason.orendorff at gmail.com (Jason Orendorff)
Date: Thu, 14 Sep 2006 16:53:58 -0400
Subject: [Python-3000] BOM handling
In-Reply-To: <1cb725390609141112j6bc22220yd290d43e90c8501@mail.gmail.com>
References: <45090D11.3060908@acm.org> <4509556F.4030508@latte.ca>
	<20060914092020.F954.JCARLSON@uci.edu>
	<1cb725390609141112j6bc22220yd290d43e90c8501@mail.gmail.com>
Message-ID: <bb8868b90609141353u3eb3846pb3f2726e41140705@mail.gmail.com>

For what it's worth:  in .NET, everything defaults to UTF-8, whether
reading or writing.  No BOM is generated when creating a new file.
  http://msdn2.microsoft.com/en-us/library/system.io.file.createtext.aspx

Java defaults to a "default character encoding", which on Windows is
the system's ANSI encoding.
  http://java.sun.com/j2se/1.4.2/docs/api/java/io/OutputStreamWriter.html

Neither correctly reads the other's output.  Pick your poison.

-j

From martin at v.loewis.de  Thu Sep 14 23:34:34 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu, 14 Sep 2006 23:34:34 +0200
Subject: [Python-3000] string C API
In-Reply-To: <45092CC2.4070700@gmail.com>
References: <ed968c$d03$1@sea.gmane.org>
	<45078B46.90408@v.loewis.de>		<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>		<450820A3.4000302@v.loewis.de>		<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>		<45083C76.8010302@v.loewis.de>	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>
Message-ID: <4509CAEA.3040108@v.loewis.de>

Nick Coghlan schrieb:
> Only the first such call on a given string, though - the idea is to use
> lazy decoding, not to avoid decoding altogether. Most manipulations
> (len, indexing, slicing, concatenation, etc) would require decoding to
> at least UCS-2 (or perhaps UCS-4).

Ok. Then my objection is this: What about errors that occur in decoding?
What happens if the bytes are not meaningful in the presumed encoding?

ISTM that raising the exception lazily (which seems to be necessary)
would be very confusing.

Regards,
Martin

From 2006 at jmunch.dk  Fri Sep 15 01:05:28 2006
From: 2006 at jmunch.dk (Anders J. Munch)
Date: Fri, 15 Sep 2006 01:05:28 +0200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <20060913084256.F930.JCARLSON@uci.edu>
References: <9B1795C95533CA46A83BA1EAD4B01030031F54@flonidanmail.flonidan.net>
	<20060913084256.F930.JCARLSON@uci.edu>
Message-ID: <4509E038.2030808@jmunch.dk>

Josiah Carlson wrote:
 > Any sane person uses os.stat(f.name) or os.fstat(f.fileno()), unless
 > they want to seek to the end of the file for later writing or expected
 > reading of data yet-to-be-written.

os.fstat(f.fileno()).st_size doesn't work for file-like objects.
Goodbye unit testing with StringIOs.  f.seek(0,2);f.tell() is faster,
too.  I think the lunatics have a point.

 > You were also talking about buffering writes to reduce the overhead of
 > the underlying seeks and tells because of apparent "optimizations" you
 > wanted to make. Here is a data integrity optimization you can make for
 > me: flush when accessing the file non-sequentially, any other behavior
 > could corrupt the data of users who have been relying on "seek implies
 > flush".

Again, that's what explicit calls to flush are for.  And you can't
violate expectations as to what the seek method does, when there's no
seek method and no concept of a file pointer.
Sprinkling extra flushes out here and there does not help data
integrity: Only a flush that is part of a well thought out plan to
recover partially written data in case of a crash, will help you do
that.  Anything less, and you're just a power failure and a disk that
reorders writes away from unrecoverable corruption.

My class consolidate writes, but doesn't reorder them.  That means
that to the extent that the system call for writing is transactional,
writes are not reordered.  I put the code up at
http://pastecode.com/4818.  As is - extending and truncating has bugs.

If you really want it, it's three lines changed to disable buffering
for non-sequential writes.  And an equivalent class completely without
buffering is pretty trivial.

 > With that said, I'm not sure your FileBytes object is really necessary
 > or desired for the future io library.  If people want that kind of an
 > interface, they can use mmap (and push for the various mmap bugs/feature
 > requests to be fixed), otherwise they should be using readable /
 > writable / both streams, something that Tomer has been working towards.

mmap has limitations that cannot be fixed.  It takes up virtual
memory, limiting the size of files you can work with.  You need to
specify the size in advance (note the potential race condition in
f=mmap.mmap(f.fileno(),os.fstat(f.fileno()))).  To what extent does it
work over networked file systems?  If you map a file on a file system
that is subsequently unmounted, a core dump may be the result.  All
this assuming the operating system supports mmap at all.

mmap is for use where speed is paramount, and pretty much only then.
The reason people don't use sequence-based file interfaces as much is
that robust, portable, practical sequence-based file interfaces aren't
available.  Probably most people who would have liked a sequence
interface do what I do: slurp up the whole file in one read and deal
with the string.  Or use mmap and live with the fragility.

- Anders


From murman at gmail.com  Fri Sep 15 01:30:09 2006
From: murman at gmail.com (Michael Urman)
Date: Thu, 14 Sep 2006 18:30:09 -0500
Subject: [Python-3000] BOM handling
In-Reply-To: <20060914112926.F95D.JCARLSON@uci.edu>
References: <20060914092020.F954.JCARLSON@uci.edu> <450997BB.6020703@latte.ca>
	<20060914112926.F95D.JCARLSON@uci.edu>
Message-ID: <dcbbbb410609141630m6aa946a8q55b6d339a5e71003@mail.gmail.com>

On 9/14/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> With luck, your editor should also allow for the
> non-writing of the BOM on utf-8 save (given certain conditions).  If not,
> contact the author(s) and request that feature.

And hope they didn't write it in a language that doesn't let them
control when to use a BOM.

-- 
Michael Urman  http://www.tortall.net/mu/blog

From jcarlson at uci.edu  Fri Sep 15 02:02:01 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Thu, 14 Sep 2006 17:02:01 -0700
Subject: [Python-3000] BOM handling
In-Reply-To: <79990c6b0609141315h716ef623y9a67b36c4ac61cd2@mail.gmail.com>
References: <20060914112926.F95D.JCARLSON@uci.edu>
	<79990c6b0609141315h716ef623y9a67b36c4ac61cd2@mail.gmail.com>
Message-ID: <20060914153932.F967.JCARLSON@uci.edu>


"Paul Moore" <p.f.moore at gmail.com> wrote:
> On 9/14/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> > So don't save it with a BOM and add a Python coding: directive to the
> > second line.  Python and bash comments just happen to have the same #
> > delimiter, and if your editor doesn't suck, then it should understand
> > such a directive.
> 
> However, vim and emacs use *different* coding directive formats.
> Python understands both, but (AFAIK) they don't understand each
> other's. So which editor sucks? :-) :-) :-) (3 smileys is a
> get-out-of-flamewar-free card :-))

Single users will be choosing a single tool.  Multiple users will likely
use a source repository.  Good source repositories will allow for pre or
post processing.  Or heck, I'm sure that Emacs or Vim can even be
tweaked to understand the others encoding declarations.  If not, there
are more than a dozen source editors that support both, and even some
that offer the features I describe.


 - Josiah


From david.nospam.hopwood at blueyonder.co.uk  Fri Sep 15 02:00:19 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Fri, 15 Sep 2006 01:00:19 +0100
Subject: [Python-3000] BOM handling
In-Reply-To: <79990c6b0609141315h716ef623y9a67b36c4ac61cd2@mail.gmail.com>
References: <20060914092020.F954.JCARLSON@uci.edu>
	<450997BB.6020703@latte.ca>	<20060914112926.F95D.JCARLSON@uci.edu>
	<79990c6b0609141315h716ef623y9a67b36c4ac61cd2@mail.gmail.com>
Message-ID: <4509ED13.9040409@blueyonder.co.uk>

Paul Moore wrote:
> On 9/14/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> 
>>So don't save it with a BOM and add a Python coding: directive to the
>>second line.  Python and bash comments just happen to have the same #
>>delimiter, and if your editor doesn't suck, then it should understand
>>such a directive.
> 
> However, vim and emacs use *different* coding directive formats.
> Python understands both, but (AFAIK) they don't understand each
> other's. So which editor sucks?

Both, obviously. It would not have been beyond the wit of those editor
developers to talk to each other, or to just unilaterally support the
other editor's format as well as their own.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>




From greg.ewing at canterbury.ac.nz  Fri Sep 15 03:32:00 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Fri, 15 Sep 2006 13:32:00 +1200
Subject: [Python-3000] iostack, second revision
In-Reply-To: <4509E038.2030808@jmunch.dk>
References: <9B1795C95533CA46A83BA1EAD4B01030031F54@flonidanmail.flonidan.net>
	<20060913084256.F930.JCARLSON@uci.edu> <4509E038.2030808@jmunch.dk>
Message-ID: <450A0290.9080204@canterbury.ac.nz>

Anders J. Munch wrote:
> (note the potential race condition in
> f=mmap.mmap(f.fileno(),os.fstat(f.fileno()))).

Not sure anything could be done about that. Even if
there were an mmap-this-file-however-big-it-is call,
the size of the file could still change *after*
you'd mapped it.

--
Greg

From jcarlson at uci.edu  Fri Sep 15 05:01:39 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Thu, 14 Sep 2006 20:01:39 -0700
Subject: [Python-3000] iostack, second revision
In-Reply-To: <4509E038.2030808@jmunch.dk>
References: <20060913084256.F930.JCARLSON@uci.edu> <4509E038.2030808@jmunch.dk>
Message-ID: <20060914191243.F972.JCARLSON@uci.edu>


"Anders J. Munch" <2006 at jmunch.dk> wrote:
> Josiah Carlson wrote:
>  > You were also talking about buffering writes to reduce the overhead of
>  > the underlying seeks and tells because of apparent "optimizations" you
>  > wanted to make. Here is a data integrity optimization you can make for
>  > me: flush when accessing the file non-sequentially, any other behavior
>  > could corrupt the data of users who have been relying on "seek implies
>  > flush".
> 
> Again, that's what explicit calls to flush are for.  And you can't
> violate expectations as to what the seek method does, when there's no
> seek method and no concept of a file pointer.

People who have experience using Python 2.x file objects and/or
underlying platform file handles may have come to expect "seek implies
flush".  Since you claim that offering an unbuffered version is easy,
I'll pretend that such would be offered to the user as an option.

> Sprinkling extra flushes out here and there does not help data
> integrity: Only a flush that is part of a well thought out plan to
> recover partially written data in case of a crash, will help you do
> that.  Anything less, and you're just a power failure and a disk that
> reorders writes away from unrecoverable corruption.

Indeed, whether or not extra flushes help data integrity depends on the
file structure.  But for those who have the know-how to properly deal
with recovery of structured data files post power outage, not flushing
due to optimization is a larger sin than actively flushing - as data may
very well have a better chance to get to disk when you are flushing more
often.


>  > With that said, I'm not sure your FileBytes object is really necessary
>  > or desired for the future io library.  If people want that kind of an
>  > interface, they can use mmap (and push for the various mmap bugs/feature
>  > requests to be fixed), otherwise they should be using readable /
>  > writable / both streams, something that Tomer has been working towards.
> 
> mmap has limitations that cannot be fixed.  It takes up virtual
> memory, limiting the size of files you can work with.  You need to
> specify the size in advance (note the potential race condition in
> f=mmap.mmap(f.fileno(),os.fstat(f.fileno()))).  To what extent does it
> work over networked file systems?  If you map a file on a file system
> that is subsequently unmounted, a core dump may be the result.  All
> this assuming the operating system supports mmap at all.

Some of your concerns can be addressed with mmap + starting offset, and
length parameter of -1.  This results in being able to map arbitrary
portions of the file, as well as a Python-level race-free construction
of an mmap.  Then the FileBytes interface essentially becomes...

class FileBytes(object):
    def __init__(self, fname, mode='r+b'):
        self.f = open(fname, mode)
    def __getitem__(self, key):
        start, stop = self._parseposition(key)
        return mmap.mmap(self.f.fileno(), start=start, stop=stop)
    def __setitem__(self, key, value):
        self[key] = value
    #_parseposition as you specify

With a non-broken platform mmap implementation, multiple identical calls
to __getitem__ will return identical data pointers, or at least the
underlying OS will make sure that the two pointers actually point to the
same physical memory region.


NFS issues are a pain.  This and the non-support of mmaps on smaller or
less developed platforms may be the only situations where not using
mmaps could offer superior failure conditions.


> mmap is for use where speed is paramount, and pretty much only then.
> The reason people don't use sequence-based file interfaces as much is
> that robust, portable, practical sequence-based file interfaces aren't
> available.  Probably most people who would have liked a sequence
> interface do what I do: slurp up the whole file in one read and deal
> with the string.  Or use mmap and live with the fragility.

I've found the opposite to be true.  Every time where I've wanted a
sequence-based file interface, I use an mmap: because it is faster and far
more reliable for all use-cases I've been confronted with (if your
process crashes, all of your writes are flushed).  But I suppose I spend
time with 512M and 1G mmaps, for which constant slicing of strings
and/or a file-based interface is about 100 times too slow (and useless
when a C extension wants to write to the file - mmaps do this for free).


 - Josiah


From ncoghlan at gmail.com  Fri Sep 15 15:29:58 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 15 Sep 2006 23:29:58 +1000
Subject: [Python-3000] string C API
In-Reply-To: <4509CAEA.3040108@v.loewis.de>
References: <ed968c$d03$1@sea.gmane.org>
	<45078B46.90408@v.loewis.de>		<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>		<450820A3.4000302@v.loewis.de>		<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>		<45083C76.8010302@v.loewis.de>	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>
	<4509CAEA.3040108@v.loewis.de>
Message-ID: <450AAAD6.5030901@gmail.com>

Martin v. L?wis wrote:
> Nick Coghlan schrieb:
>> Only the first such call on a given string, though - the idea is to use
>> lazy decoding, not to avoid decoding altogether. Most manipulations
>> (len, indexing, slicing, concatenation, etc) would require decoding to
>> at least UCS-2 (or perhaps UCS-4).
> 
> Ok. Then my objection is this: What about errors that occur in decoding?
> What happens if the bytes are not meaningful in the presumed encoding?
> 
> ISTM that raising the exception lazily (which seems to be necessary)
> would be very confusing.

Yeah, it appears it would be necessary to at least *scan* the string when it 
was first created in order to ensure it can be decoded without errors later on.

I also realised there is another issue with an internal representation that 
can change over the life of a string, which is that of thread-safety.

Since strings don't currently have any mutable internal state, it's possible 
to freely share them between threads (without this property, the interning 
behaviour would be doomed).

If strings could change the encoding of their internal buffers then they'd 
have to use a read/write lock internally on all operations that might be 
affected when the internal representation changes. Blech.

Far, far simpler is the idea of supporting only latin-1, UCS-2 and UCS-4 as 
internal representations, and choosing which one to use when the string is 
created.

Sure certain applications that are just copying from one data stream to 
another (both in the same encoding) may needlessly decode and then re-encode 
the data, but if the application *knows* that this might happen (and has 
reason to care about optimising the performance of this case), then the 
application is free to decouple the "reading" and "decoding" steps, and just 
transfer raw bytes between the streams.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From jimjjewett at gmail.com  Fri Sep 15 16:25:08 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Fri, 15 Sep 2006 10:25:08 -0400
Subject: [Python-3000] string C API
In-Reply-To: <450AAAD6.5030901@gmail.com>
References: <ed968c$d03$1@sea.gmane.org>
	<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>
	<450820A3.4000302@v.loewis.de>
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>
	<45083C76.8010302@v.loewis.de>
	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>
	<4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com>
Message-ID: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>

On 9/15/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Martin v. L?wis wrote:
> > Nick Coghlan schrieb:
> >> Only the first such call on a given string, though - the idea is to use
> >> lazy decoding, not to avoid decoding altogether. Most manipulations
> >> (len, indexing, slicing, concatenation, etc) would require decoding to
> >> at least UCS-2 (or perhaps UCS-4).

Or other workarounds.

> > Ok. Then my objection is this: What about errors that occur in decoding?
> > What happens if the bytes are not meaningful in the presumed encoding?

> > ISTM that raising the exception lazily (which seems to be necessary)
> > would be very confusing.

> Yeah, it appears it would be necessary to at least *scan* the string when it
> was first created in order to ensure it can be decoded without errors later on.

What happens today with strings?  I think the answer is:
     "Nothing.
      They print something odd when printed.
      They may raise errors when explicitly recoded to unicde."
Why is this a problem?

I see nothing wrong with an explicit .validate() method.

I see nothing wrong with a program choosing to recode everything into
a known encoding, which would validate as a side-effect.  This would
be the moral equivalent of today's unicode() call.

I'm not so happy about the efficiency implication of the idea that
*all* strings *must* be validated (let alone recoded).

> I also realised there is another issue with an internal representation that
> can change over the life of a string, which is that of thread-safety.

> Since strings don't currently have any mutable internal state, it's possible
> to freely share them between threads (without this property, the interning
> behaviour would be doomed).

Interning may get awkward if multiple encodings are allowed within a
program, regardless of whether they're allowed for single strings.  It
might make sense to intern only strings that are in the same encoding
as the source code.  (Or whose values are limited to ASCII?)

> If strings could change the encoding of their internal buffers then they'd
> have to use a read/write lock internally on all operations that might be
> affected when the internal representation changes. Blech.

Why?

There should be only one reference to a string until is constructed,
and after that, its data should be immutable.  Recoding that results
in different bytes should not be in-place.  Either it returns a new
string (no problem) or it doesn't change the databuffer-and-encoding
pointer until the new databuffer is fully constructed.

Anything keeping its own reference to the old databuffer (and old
encoding) will continue to work, so immutability ==> the two buffers
really are equivalent.

> Sure certain applications that are just copying from one data stream to
> another (both in the same encoding) may needlessly decode and then re-encode
> the data,

Other than text editors, "certain" includes almost any application I
have ever used, let alone written.

> but if the application *knows* that this might happen (and has
> reason to care about optimising the performance of this case), then the
> application is free to decouple the "reading" and "decoding" steps, and just
> transfer raw bytes between the streams.

So adding boilerplate to treat text as bytes "for efficiency" may
become a standard recipe?  Not so good.

-jJ

From ncoghlan at gmail.com  Fri Sep 15 17:15:27 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sat, 16 Sep 2006 01:15:27 +1000
Subject: [Python-3000] string C API
In-Reply-To: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
References: <ed968c$d03$1@sea.gmane.org>	
	<fb6fbf560609130634n2c07c29r8853b2e14068422c@mail.gmail.com>	
	<450820A3.4000302@v.loewis.de>	
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>	
	<45083C76.8010302@v.loewis.de>	
	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>	
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>	
	<4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com>
	<fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
Message-ID: <450AC38F.4080005@gmail.com>

Jim Jewett wrote:
>> > ISTM that raising the exception lazily (which seems to be necessary)
>> > would be very confusing.
> 
>> Yeah, it appears it would be necessary to at least *scan* the string 
>> when it
>> was first created in order to ensure it can be decoded without errors 
>> later on.
> 
> What happens today with strings?  I think the answer is:
>     "Nothing.
>      They print something odd when printed.
>      They may raise errors when explicitly recoded to unicde."
> Why is this a problem?

We don't have 8-bit strings lying around in Py3k. To convert bytes to 
characters, they *must* be converted to unicode code points.

> I'm not so happy about the efficiency implication of the idea that
> *all* strings *must* be validated (let alone recoded).

Then always define latin-1 as the source encoding for your files - it will 
just pass the bytes straight through.

>> Since strings don't currently have any mutable internal state, it's 
>> possible
>> to freely share them between threads (without this property, the 
>> interning
>> behaviour would be doomed).
> 
> Interning may get awkward if multiple encodings are allowed within a
> program, regardless of whether they're allowed for single strings.  It
> might make sense to intern only strings that are in the same encoding
> as the source code.  (Or whose values are limited to ASCII?)

Unicode strings don't have an encoding - they only store code points.

>> If strings could change the encoding of their internal buffers then 
>> they'd
>> have to use a read/write lock internally on all operations that might be
>> affected when the internal representation changes. Blech.
> 
> Why?
> 
> There should be only one reference to a string until is constructed,
> and after that, its data should be immutable.  Recoding that results
> in different bytes should not be in-place.  Either it returns a new
> string (no problem) or it doesn't change the databuffer-and-encoding
> pointer until the new databuffer is fully constructed.
> 
> Anything keeping its own reference to the old databuffer (and old
> encoding) will continue to work, so immutability ==> the two buffers
> really are equivalent.

I admit that by using a separate Python object for the data buffer instead of 
a pointer to raw memory, the incref/decref in the processing code becomes the 
moral equivalent of a read lock, but consider the case where Thread A performs 
an operation and decides "I need to recode the buffer to UCS-4" at the same 
time that Thread B performs an operation and decides "I need to recode the 
buffer to UCS-4".

To deal with that you would still want to be very careful with the incref 
new/reassign/decref old step for switching in a new the data buffer (probably 
by using some form of atomic reassignment operation).

And this style has some very serious overhead implications, as each string 
would now require:
   The string object, with a 32 or 64 bit pointer to the data buffer object
   The data buffer object

String memory overhead would double, with an additional 32 or 64 bits 
depending on platform. This is a pretty significant increase when it comes to 
identifier-length strings.

So still blech, even if you make the data buffer a separate Python object to 
avoid the need for an actual read/write lock.

>> Sure certain applications that are just copying from one data stream to
>> another (both in the same encoding) may needlessly decode and then 
>> re-encode
>> the data,
> 
> Other than text editors, "certain" includes almost any application I
> have ever used, let alone written.

If you're reading text and you *know* it is ASCII data, then you can just set 
the encoding to latin-1 (since that can just copy the original bytes to the 
string's internal buffer - the actual ascii codec needs to check each byte to 
see whether or not the high bit is set, so it would be slower, and blow up 
with a DecodingError if the high bit was ever set).

I suspect an awful lot of quick-and-dirty scripts written by native English 
speakers will do exactly that.

>> but if the application *knows* that this might happen (and has
>> reason to care about optimising the performance of this case), then the
>> application is free to decouple the "reading" and "decoding" steps, 
>> and just
>> transfer raw bytes between the streams.
> 
> So adding boilerplate to treat text as bytes "for efficiency" may
> become a standard recipe?  Not so good.

No, the standard recipe becomes "handle bytes as bytes and text as 
characters". If you know your source data is 8-bit text (or are happy to treat 
it that way, even if it isn't), then use the latin-1 codec to decode the 
original bytes directly to 8-bit characters.

Or just open the file in binary and read the data in as bytes instead of 
characters.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From jason.orendorff at gmail.com  Fri Sep 15 18:22:30 2006
From: jason.orendorff at gmail.com (Jason Orendorff)
Date: Fri, 15 Sep 2006 12:22:30 -0400
Subject: [Python-3000] string C API
In-Reply-To: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
References: <ed968c$d03$1@sea.gmane.org> <450820A3.4000302@v.loewis.de>
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>
	<45083C76.8010302@v.loewis.de>
	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>
	<4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com>
	<fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
Message-ID: <bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>

On 9/15/06, Jim Jewett <jimjjewett at gmail.com> wrote:
> There should be only one reference to a string until is constructed,
> and after that, its data should be immutable.  Recoding that results
> in different bytes should not be in-place.  Either it returns a new
> string (no problem) or it doesn't change the databuffer-and-encoding
> pointer until the new databuffer is fully constructed.

Yes, but then having, say, a Latin-1 string, and repeatedly using it
in places where UTF-16 is needed, causes you to repeat the decoding
operation.  The optimization becomes a pessimization.

Here I'm imagining things like taking len(s) of a UTF-8 string, or
s==u where u happens to be UTF-16.  You only have to do this once or
twice per string to start losing.

Also, having two different classes of strings means fewer felicitous
cases of x==y, where the result is True, being just a pointer
comparison.  This might matter in dictionaries: imagine a dictionary
created as a literal and then used to look up key strings read from a
file.

> [Nick Coghlan wrote:]
> > [...] the
> > application is free to decouple the "reading" and "decoding" steps, and just
> > transfer raw bytes between the streams.
>
> So adding boilerplate to treat text as bytes "for efficiency" may
> become a standard recipe?  Not so good.

I'm sure this will happen to the same degree that it's become a
standard recipe in Java and C# (both of which lack polymorphic
whatzits).  Which is to say, not at all.

-j

From paul at prescod.net  Fri Sep 15 18:33:49 2006
From: paul at prescod.net (Paul Prescod)
Date: Fri, 15 Sep 2006 09:33:49 -0700
Subject: [Python-3000] string C API
In-Reply-To: <bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>
References: <ed968c$d03$1@sea.gmane.org>
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>
	<45083C76.8010302@v.loewis.de>
	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>
	<4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com>
	<fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
	<bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>
Message-ID: <1cb725390609150933q43b444f5ne788e9d222a5dcd1@mail.gmail.com>

On 9/15/06, Jason Orendorff <jason.orendorff at gmail.com> wrote:
>
> I'm sure this will happen to the same degree that it's become a
> standard recipe in Java and C# (both of which lack polymorphic
> whatzits).  Which is to say, not at all.


I think Jason's point is key. This is probably premature optimization and
should not be done if it will complicate the Python user's experience at all
(e.g. by delaying exceptions). Polymorphism is interesting to me primarily
to support 4-byte characters and therefore go beyond Java and C# in
functionality without slowing everything else down. If we gain some speed on
them for 8-bit strings, that would be a nice bonus.

But delaying UTF-8 decoding has not proven necessary for good performance in
the other Unicode-based languages. It just seems like extra complexity for
little benefit.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060915/c0109d27/attachment.htm 

From jimjjewett at gmail.com  Fri Sep 15 19:04:08 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Fri, 15 Sep 2006 13:04:08 -0400
Subject: [Python-3000] string C API
In-Reply-To: <450AC38F.4080005@gmail.com>
References: <ed968c$d03$1@sea.gmane.org>
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>
	<45083C76.8010302@v.loewis.de>
	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>
	<4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com>
	<fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
	<450AC38F.4080005@gmail.com>
Message-ID: <fb6fbf560609151004g5f285483x5733f3774e9622f1@mail.gmail.com>

On 9/15/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Jim Jewett wrote:

> >> ... would be necessary to at least *scan* the string when it
> >> was first created in order to ensure it can be decoded without errors

> > What happens today with strings?  I think the answer is:
> >     "Nothing.
> >      They print something odd when printed.
> >      They may raise errors when explicitly recoded to unicde."
> > Why is this a problem?

> We don't have 8-bit strings lying around in Py3k.

Right.  But we do in Py 2.x, and the equivalent delayed errors have
not been a serious problem.  I suppose that might change if everyone
were actually using unicode, so that more stuff got converted
eventually.  On the other hand, I'm not sure how many strings will
*ever* need recoding, if we don't do it on construction.

> To convert bytes to
> characters, they *must* be converted to unicode code points.

A "code point" doesn't exist in actual code; it has to be represented
by some concrete encoding.  The most common encodings are the UTF-8
and the various UTF-16 and UTF-32, but they are still concrete
encodings, rather than the "real" code point.  A bytestream in latin-1
(with meta-knowledge that it is in latin-1) represents that abstract
code points just as much as a bytestream in UTF8 would.  For some
purposes (including error detection) it is less efficient, but it is
just as valid.

> > I'm not so happy about the efficiency implication of the idea that
> > *all* strings *must* be validated (let alone recoded).

> Then always define latin-1 as the source encoding for your files - it will
> just pass the bytes straight through.

That would work for skipping validation.  It won't work if Python
insists on recoding everything to an internally privileged encoding.

> > Interning may get awkward if multiple encodings are allowed within a
> > program, regardless of whether they're allowed for single strings.  It
> > might make sense to intern only strings that are in the same encoding
> > as the source code.  (Or whose values are limited to ASCII?)

> Unicode strings don't have an encoding - they only store code points.

But these code points are stored somehow.  In py2.k, the decision was
to always use a specific privileged encoding, and to choose that
encoding at compile time.  This decision was not required by unicode;
it was chosen for implementation reasons.

> I admit that by using a separate Python object for the data buffer instead of
> a pointer to raw memory, the incref/decref in the processing code becomes the
> moral equivalent of a read lock, but consider the case where Thread A performs
> an operation and decides "I need to recode the buffer to UCS-4" at the same
> time that Thread B performs an operation and decides "I need to recode the
> buffer to UCS-4".

Then you end up doing it twice, and wasting even more space.   I
expect "never need to change the encoding" will be far more common
than
        (1)  Application is multithreaded
and     (2)  Multiple threads happen to be using the same string
and     (3)  Multiple threads need to recode it to the same new
encoding at the same time
and     (4)  This recoding need was in some way conditional, so the
programmer felt it was sensible to request it both places, instead of
just recoding once on creation.

> And this style has some very serious overhead implications, as each string
> would now require:
>    The string object, with a 32 or 64 bit pointer to the data buffer object
>    The data buffer object

> String memory overhead would double, with an additional 32 or 64 bits
> depending on platform. This is a pretty significant increase when it comes to
> identifier-length strings.

dicts already have to deal with this.  The workaround there was to
have a smalltable fastened to the dict, and to waste that smalltable
if the dictionary grows too large.  strings could do something
similar.  (Either all strings, keeping the original encoding, or just
small strings, so that not too much will ever be wasted.)

> >> Sure certain applications that are just copying from one data stream to
> >> another (both in the same encoding) may needlessly decode and then
> >> re-encode the data,

> > Other than text editors, "certain" includes almost any application I
> > have ever used, let alone written.

> If you're reading text and you *know* it is ASCII data, then you can just set
> the encoding to latin-1

Only if latin-1 is a valid encoding for the internal implementation.
If it is, then python does have to allow multiple internal
implementations, and some way of marking which was used.  (Obviously,
I think this is the right answer, but this is a change form 2.x, and
would require some changes to the C API.)

-jJ

From jcarlson at uci.edu  Fri Sep 15 19:46:52 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Fri, 15 Sep 2006 10:46:52 -0700
Subject: [Python-3000] string C API
In-Reply-To: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
References: <450AAAD6.5030901@gmail.com>
	<fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
Message-ID: <20060915102433.F980.JCARLSON@uci.edu>


"Jim Jewett" <jimjjewett at gmail.com> wrote:
> Interning may get awkward if multiple encodings are allowed within a
> program, regardless of whether they're allowed for single strings.  It
> might make sense to intern only strings that are in the same encoding
> as the source code.  (Or whose values are limited to ASCII?)

Why?  If the text hash function is defined on *code points*, then
interning, or really any arbitrary dictionary lookup is the same as it
has always been.


> There should be only one reference to a string until is constructed,
> and after that, its data should be immutable.  Recoding that results
> in different bytes should not be in-place.  Either it returns a new
> string (no problem) or it doesn't change the databuffer-and-encoding
> pointer until the new databuffer is fully constructed.

What about never recoding?  The benefit of the latin-1/ucs-2/ucs-4
method I previously described is that each of the encodings offer a
minimal representation of the code points that the text object contains. 
Certain operations would require a bit of work to handle the comparison
of code points stored in an x-bit-wide representation with code points
stored in a y-bit-wide representation.


> So adding boilerplate to treat text as bytes "for efficiency" may
> become a standard recipe?  Not so good.

Presumably there is going to be a mechanism to open files as bytes
(reads return bytes), and for things like web servers, file servers, etc.,
serving the content up as just a bunch of bytes is really the only thing
that makes sense.

 - Josiah


From jcarlson at uci.edu  Fri Sep 15 19:48:06 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Fri, 15 Sep 2006 10:48:06 -0700
Subject: [Python-3000] string C API
In-Reply-To: <bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>
References: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
	<bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>
Message-ID: <20060915102555.F983.JCARLSON@uci.edu>


"Jason Orendorff" <jason.orendorff at gmail.com> wrote:
> 
> On 9/15/06, Jim Jewett <jimjjewett at gmail.com> wrote:
> > There should be only one reference to a string until is constructed,
> > and after that, its data should be immutable.  Recoding that results
> > in different bytes should not be in-place.  Either it returns a new
> > string (no problem) or it doesn't change the databuffer-and-encoding
> > pointer until the new databuffer is fully constructed.
> 
> Yes, but then having, say, a Latin-1 string, and repeatedly using it
> in places where UTF-16 is needed, causes you to repeat the decoding
> operation.  The optimization becomes a pessimization.
> 
> Here I'm imagining things like taking len(s) of a UTF-8 string, or
> s==u where u happens to be UTF-16.  You only have to do this once or
> twice per string to start losing.

This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:

If I have a text object X whose internal representation is in UCS-2, and
I have a another text object Y whose internal representation is in UCS-4,
then I know X != Y.  Why?  Because X and Y were created with the minimal
width necessary to support the code points they contain. Because Y must
have a code point that X doesn't have, then X != Y.

When one wants to do things like Y.startswith(X), then you actually
compare the code points.


 - Josiah


From solipsis at pitrou.net  Fri Sep 15 20:04:33 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Fri, 15 Sep 2006 20:04:33 +0200
Subject: [Python-3000] string C API
In-Reply-To: <20060915102555.F983.JCARLSON@uci.edu>
References: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
	<bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>
	<20060915102555.F983.JCARLSON@uci.edu>
Message-ID: <1158343473.4292.14.camel@fsol>

Le vendredi 15 septembre 2006 ? 10:48 -0700, Josiah Carlson a ?crit :
> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:

You could replace "latin-1" with "one-byte system encoding chosen at
interpreter startup depending on locale".
There are lots of 8-bit encodings other than iso-8859-1.
(for example, my current locale uses iso-8859-15)

The algorithm for choosing the one-byte encoding could be:
- if the current locale uses an one-byte encoding, use that encoding
- otherwise, if current locale language has a popular one-byte encoding
(for many languages this would mean iso-8859-<X>), use that encoding
- otherwise, no one-byte encoding

This would ensure that, for example, Russian text on a system configured
with a Russian locale does not always end up using two bytes per
character internally.

Regards

Antoine.



From qrczak at knm.org.pl  Fri Sep 15 20:29:55 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Fri, 15 Sep 2006 20:29:55 +0200
Subject: [Python-3000] string C API
In-Reply-To: <1158343473.4292.14.camel@fsol> (Antoine Pitrou's message of
	"Fri, 15 Sep 2006 20:04:33 +0200")
References: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
	<bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>
	<20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol>
Message-ID: <871wqdhtjw.fsf@qrnik.zagroda>

Antoine Pitrou <solipsis at pitrou.net> writes:

>> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:
>
> You could replace "latin-1" with "one-byte system encoding chosen at
> interpreter startup depending on locale".

Latin-1 has the advantage of being trivially decodable to a sequence
of code points.

This is convenient for operations like string concatenation, or string
comparison, or taking substrings.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From paul at prescod.net  Fri Sep 15 20:36:21 2006
From: paul at prescod.net (Paul Prescod)
Date: Fri, 15 Sep 2006 11:36:21 -0700
Subject: [Python-3000] string C API
In-Reply-To: <1158343473.4292.14.camel@fsol>
References: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
	<bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>
	<20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol>
Message-ID: <1cb725390609151136j14678530x97bd3dc30e1f6ca4@mail.gmail.com>

On 9/15/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
>
> Le vendredi 15 septembre 2006 ? 10:48 -0700, Josiah Carlson a ?crit :
> > This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:
>
> You could replace "latin-1" with "one-byte system encoding chosen at
> interpreter startup depending on locale".
> There are lots of 8-bit encodings other than iso-8859-1.
> (for example, my current locale uses iso-8859-15)
>
> The algorithm for choosing the one-byte encoding could be:
> - if the current locale uses an one-byte encoding, use that encoding
> - otherwise, if current locale language has a popular one-byte encoding
> (for many languages this would mean iso-8859-<X>), use that encoding
> - otherwise, no one-byte encoding
>
> This would ensure that, for example, Russian text on a system configured
> with a Russian locale does not always end up using two bytes per
> character internally.


I do not believe that this extra complexity will be valuable in the
long-term because most Europeans will switch to UTF-8 locales over the next
five years. The current situation makes no sense. Think about it from the
end-user's point of view:

"You can use KOI8-R/ISO-8859-? or UTF-8.

Pro for KOI8-R:

1. text files will use 0.8% instead of 1% of your hard disk space.
2. backwards compatibility

Pro for UTF-8:

1. Better compatibility with new software
2. Easier to share files across geographic boundaries
3. Ability to encode characters from other character sets
4. Access to characters like smart quotes, wingdings, fractions and so
forth.
"

The result seems obvious to me...8-bit-fixed encodings are a terrible idea
and need to just go away. Let's not build them into Python's core on the
basis of a minor and fleeting performance improvement.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060915/42867987/attachment.html 

From jcarlson at uci.edu  Fri Sep 15 23:16:57 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Fri, 15 Sep 2006 14:16:57 -0700
Subject: [Python-3000] string C API
In-Reply-To: <1cb725390609151136j14678530x97bd3dc30e1f6ca4@mail.gmail.com>
References: <1158343473.4292.14.camel@fsol>
	<1cb725390609151136j14678530x97bd3dc30e1f6ca4@mail.gmail.com>
Message-ID: <20060915133827.F98C.JCARLSON@uci.edu>


"Paul Prescod" <paul at prescod.net> wrote:
[snip]
> The result seems obvious to me...8-bit-fixed encodings are a terrible idea
> and need to just go away. Let's not build them into Python's core on the
> basis of a minor and fleeting performance improvement.

Variable-width encodings make many operations difficult, not the least
of which being "what is the code point for the ith character?"  The
benefit of going with a fixed-width encoding (like Python currently does
for unicode objects with UCS-2) is that so many computations are merely
an iteration over a sequence of chars/shorts/ints.  No need to recode
for complicated operations, no need to understand utf-8 for string
operations, etc.


 - Josiah


From jimjjewett at gmail.com  Fri Sep 15 23:37:41 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Fri, 15 Sep 2006 17:37:41 -0400
Subject: [Python-3000] string C API
In-Reply-To: <20060915102433.F980.JCARLSON@uci.edu>
References: <450AAAD6.5030901@gmail.com>
	<fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
	<20060915102433.F980.JCARLSON@uci.edu>
Message-ID: <fb6fbf560609151437l714541a9t933e4ef865aaf12b@mail.gmail.com>

On 9/15/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Jim Jewett" <jimjjewett at gmail.com> wrote:
> > Interning may get awkward if multiple encodings are allowed within a
> > program, regardless of whether they're allowed for single strings.  It
> > might make sense to intern only strings that are in the same encoding
> > as the source code.  (Or whose values are limited to ASCII?)

> Why?  If the text hash function is defined on *code points*, then
> interning, or really any arbitrary dictionary lookup is the same as it
> has always been.

The problem isn't the hash; it is the equality.  Which encoding do you
keep interned?

> What about never recoding?  The benefit of the latin-1/ucs-2/ucs-4
> method I previously described is that each of the encodings offer a
> minimal representation of the code points that the text object contains.

There may be some thrashing as

    s+= (larger char)
    s[:6]

The three options might well be a sensible choice, but I think it
would already have much of the disadvantage of multiple internal
encodings, and we might eventually regret any specific limits.  (Why
not the local 8-bit?  Why not UTF-8, if that is the system encoding?)
It is easy enough to answer why not for each specific case, but I'm
not *certain* that it is the right answer -- so why not leave it up to
implementors if they want to do more than the basic three?

> Presumably there is going to be a mechanism to open files as bytes
> (reads return bytes), and for things like web servers, file servers, etc.,
> serving the content up as just a bunch of bytes is really the only thing
> that makes sense.

If somone has to recognize that their document is "text" when they
edit it, but "bytes" when they serve it over the web, and then "text"
again when they view it in the browser ... that is a recipe for
misunderstandings.

-jJ

From jcarlson at uci.edu  Sat Sep 16 02:13:33 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Fri, 15 Sep 2006 17:13:33 -0700
Subject: [Python-3000] string C API
In-Reply-To: <fb6fbf560609151437l714541a9t933e4ef865aaf12b@mail.gmail.com>
References: <20060915102433.F980.JCARLSON@uci.edu>
	<fb6fbf560609151437l714541a9t933e4ef865aaf12b@mail.gmail.com>
Message-ID: <20060915153702.F98F.JCARLSON@uci.edu>


"Jim Jewett" <jimjjewett at gmail.com> wrote:
> On 9/15/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> > "Jim Jewett" <jimjjewett at gmail.com> wrote:
> > > Interning may get awkward if multiple encodings are allowed within a
> > > program, regardless of whether they're allowed for single strings.  It
> > > might make sense to intern only strings that are in the same encoding
> > > as the source code.  (Or whose values are limited to ASCII?)
> 
> > Why?  If the text hash function is defined on *code points*, then
> > interning, or really any arbitrary dictionary lookup is the same as it
> > has always been.
> 
> The problem isn't the hash; it is the equality.  Which encoding do you
> keep interned?

There is one minimal 'encoding' for any unicode string (in one of
latin-1, ucs-2, or ucs-4), really being an array of minimal-width
char/short/int code points. Because all text objects are internally
represented in its minimal 'encoding', equal text objects will always be
in the same encoding.


> > What about never recoding?  The benefit of the latin-1/ucs-2/ucs-4
> > method I previously described is that each of the encodings offer a
> > minimal representation of the code points that the text object contains.
> 
> There may be some thrashing as
> 
>     s+= (larger char)
>     s[:6]

So there may be thrashing.  I don't see this as a problem.  String
addition and slicing is known linear in the length of the string being
produced for all nontrivial cases.  It's still linear.  What's the
problem?


> The three options might well be a sensible choice, but I think it
> would already have much of the disadvantage of multiple internal
> encodings, and we might eventually regret any specific limits.  (Why
> not the local 8-bit?  Why not UTF-8, if that is the system encoding?)
> It is easy enough to answer why not for each specific case, but I'm
> not *certain* that it is the right answer -- so why not leave it up to
> implementors if they want to do more than the basic three?

By "basic three" I presume you mean latin-1, ucs-2, and ucs-4.  I'm not
advocating for anything beyond those, in fact, I'm specifically
discouraging using anything other than those three, and I'm specifically
discouraging the idea of recoding internal representations.  Once a
text object is created, its internal state is fixed until it is
destroyed.


> > Presumably there is going to be a mechanism to open files as bytes
> > (reads return bytes), and for things like web servers, file servers, etc.,
> > serving the content up as just a bunch of bytes is really the only thing
> > that makes sense.
> 
> If somone has to recognize that their document is "text" when they
> edit it, but "bytes" when they serve it over the web, and then "text"
> again when they view it in the browser ... that is a recipe for
> misunderstandings.

They don't need to recognize anything when it is served onto the web. 
Just like they don't need to recognize anything right now.  The file is
served verbatim off of disk, which is then understood by the browser
because of encoding information built into the format.  If the format
doesn't have encoding information built into it, then the user isn't
going to be able to edit it.

 - Josiah


From and at doxdesk.com  Sat Sep 16 03:08:06 2006
From: and at doxdesk.com (Andrew Clover)
Date: Sat, 16 Sep 2006 02:08:06 +0100
Subject: [Python-3000] UTF-16
In-Reply-To: <1cb725390608312124u24d20ec2q27dbe5a69c2440d3@mail.gmail.com>
References: <1cb725390608312032t388c250by13befed154b4442d@mail.gmail.com>	<ca471dc20608312046m120f23adx494d50723289029@mail.gmail.com>
	<1cb725390608312124u24d20ec2q27dbe5a69c2440d3@mail.gmail.com>
Message-ID: <450B4E76.1010501@doxdesk.com>

On 2006-09-01, Paul Prescod wrote:

> I cannot understand why a user should be forced to choose between 16 and 32
> bit strings AT BUILD TIME.

I strongly agree. This has been troublesome for many, not just people 
trying to install binary libs, but also Python code that does actually 
need to know the difference between unicode and wide-unicode characters.

Ideally, implementation work notwithstanding, I would *love* to be able 
to have both types at a literal level (as unicode subclasses), along 
with retained byte string literals.

     ucs2string= u'\U00010000'  # 2 chars, \ud800\udc00
     ucs4string= w'\U00010000'  # 1 char
     bytestring= b'abc'
     string= 'abc'              # byte in 2.x, ucs2 in 3.0

If these were all subclasses of basestring, and other string type 
subclasses could be defined taking advantage of basic string methods, 
that could also allow the CSI stuff you posted Matz's mention of. 
Although I'm personally not at all a fan of non-Unicode string types and 
would rather die than put i-mode emoji in a character set :-)

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/

From greg.ewing at canterbury.ac.nz  Sat Sep 16 03:07:06 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sat, 16 Sep 2006 13:07:06 +1200
Subject: [Python-3000] string C API
In-Reply-To: <20060915153702.F98F.JCARLSON@uci.edu>
References: <20060915102433.F980.JCARLSON@uci.edu>
	<fb6fbf560609151437l714541a9t933e4ef865aaf12b@mail.gmail.com>
	<20060915153702.F98F.JCARLSON@uci.edu>
Message-ID: <450B4E3A.7000005@canterbury.ac.nz>

Josiah Carlson wrote:
> Because all text objects are internally
> represented in its minimal 'encoding', equal text objects will always be
> in the same encoding.

That places a burden on all creators of strings to ensure
that they are in the minimal format, which could be
inconvenient for some operations, e.g. taking a substring
could require making an extra pass to re-code the data.
It would also preclude the possibility of representing
a substring as a view.

I don't see any great advantage given by this restriction
anyway. So you could tell two strings were unequal in
some cases if they happened to have different storage
formats, but there would still be plenty of cases
where you did have to compare them. Doesn't look like
a big deal to me.

--
Greg

From ncoghlan at gmail.com  Sat Sep 16 05:14:49 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sat, 16 Sep 2006 13:14:49 +1000
Subject: [Python-3000] string C API
In-Reply-To: <fb6fbf560609151004g5f285483x5733f3774e9622f1@mail.gmail.com>
References: <ed968c$d03$1@sea.gmane.org>	
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>	
	<45083C76.8010302@v.loewis.de>	
	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>	
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>	
	<4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com>	
	<fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>	
	<450AC38F.4080005@gmail.com>
	<fb6fbf560609151004g5f285483x5733f3774e9622f1@mail.gmail.com>
Message-ID: <450B6C29.8060107@gmail.com>

Jim Jewett wrote:
> On 9/15/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> If you're reading text and you *know* it is ASCII data, then you can 
>> just set
>> the encoding to latin-1
> 
> Only if latin-1 is a valid encoding for the internal implementation.

I think the possible internal encodings should be latin-1, UCS-2 and UCS-4, 
with the size for a given string dictated by the largest codepoint in the 
string at creation time.

That way the internal representation of a string would only need to grow one 
extra field (the one saying how many bytes there are per character), and the 
internal state would remain immutable.

For 8-bit source data, 'latin-1' would then be the most efficient encoding, in 
that it would be a simple memcpy from the bytes object's internal buffer to 
the string object's internal buffer. Other encodings like 'koi8-r' would be 
decoded to either latin-1, UCS-2 or UCS-4 depending on the largest code point 
in the source data.

[Jim]
> If it is, then python does have to allow multiple internal
> implementations, and some way of marking which was used.  (Obviously,
> I think this is the right answer, but this is a change form 2.x, and
> would require some changes to the C API.)

One of the paragraphs you cut when replying to my message:

[Nick]
>> Far, far simpler is the idea of supporting only latin-1, UCS-2 and UCS-4 as 
>> internal representations, and choosing which one to use when the string is 
>> created.

I think we might be violently agreeing :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From ncoghlan at gmail.com  Sat Sep 16 05:46:36 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sat, 16 Sep 2006 13:46:36 +1000
Subject: [Python-3000] string C API
In-Reply-To: <1158343473.4292.14.camel@fsol>
References: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>	<bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>	<20060915102555.F983.JCARLSON@uci.edu>
	<1158343473.4292.14.camel@fsol>
Message-ID: <450B739C.20607@gmail.com>

Antoine Pitrou wrote:
> Le vendredi 15 septembre 2006 ? 10:48 -0700, Josiah Carlson a ?crit :
>> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:
> 
> You could replace "latin-1" with "one-byte system encoding chosen at
> interpreter startup depending on locale".
> There are lots of 8-bit encodings other than iso-8859-1.
> (for example, my current locale uses iso-8859-15)

The choice of latin-1 is deliberate and non-arbitrary. The reason for the 
choice is that the ordinals 0-255 in latin-1 map to the Unicode code points 0-255:

 >>> x = range(256)
 >>> xs = ''.join(map(chr, x))
 >>> xu = xs.decode('latin-1')
 >>> all(ord(s)==ord(u) for s, u in zip(xs, xu))
True

In effect, when creating the string, you would be doing something like this:

   if encoding == 'latin-1':
       bytes_per_char = 1
       code_points = 8_bit_data
   else:
       code_points, max_code_point = decode_to_UCS4(8_bit_data, encoding)
       if max_code_point < 256:
           bytes_per_char = 1
       elif max_code_point < 65536:
           bytes_per_char = 2
       else:
           bytes_per_char = 4
   # A width argument to the bytes constructor would be very convenient
   # for being able to consistently deal with endianness issues
   self.internal_buffer = bytes(code_points, width=bytes_per_char)
   self.bytes_per_char = bytes_per_char

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From ronaldoussoren at mac.com  Sat Sep 16 07:59:45 2006
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Sat, 16 Sep 2006 07:59:45 +0200
Subject: [Python-3000] string C API
In-Reply-To: <fb6fbf560609151004g5f285483x5733f3774e9622f1@mail.gmail.com>
References: <ed968c$d03$1@sea.gmane.org>
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>
	<45083C76.8010302@v.loewis.de>
	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>
	<4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com>
	<fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
	<450AC38F.4080005@gmail.com>
	<fb6fbf560609151004g5f285483x5733f3774e9622f1@mail.gmail.com>
Message-ID: <58B2A910-3360-4480-A8CF-2D4C95F56981@mac.com>


On Sep 15, 2006, at 7:04 PM, Jim Jewett wrote:

> On 9/15/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> Jim Jewett wrote:
>
>>>> ... would be necessary to at least *scan* the string when it
>>>> was first created in order to ensure it can be decoded without  
>>>> errors
>
>>> What happens today with strings?  I think the answer is:
>>>     "Nothing.
>>>      They print something odd when printed.
>>>      They may raise errors when explicitly recoded to unicde."
>>> Why is this a problem?
>
>> We don't have 8-bit strings lying around in Py3k.
>
> Right.  But we do in Py 2.x, and the equivalent delayed errors have
> not been a serious problem.  I suppose that might change if everyone
> were actually using unicode, so that more stuff got converted
> eventually.  On the other hand, I'm not sure how many strings will
> *ever* need recoding, if we don't do it on construction.

Automatic conversion from str to unicode in Py2.x is a annoying at  
times, mostly because it is easy to mis at development time. Using  
unicode throughout (explicit conversion to unicode at the application  
boundary) solves that, but that problem would reappear if unicode 
(somestr, someencoding) would return a value that might cause a when  
you try to access its value UnicodeError.

Another reason for disliking your idea is that unicode/py3k-str is a  
sequence of unicode code points and should always behave like one to  
the user. A polymorphic string type is an optimization (and an  
unproven one at that) and shouldn't complicate the Python-level  
string API.

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2157 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20060916/594b545a/attachment.bin 

From martin at v.loewis.de  Sat Sep 16 08:32:37 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 16 Sep 2006 08:32:37 +0200
Subject: [Python-3000] string C API
In-Reply-To: <450B6C29.8060107@gmail.com>
References: <ed968c$d03$1@sea.gmane.org>	
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>	
	<45083C76.8010302@v.loewis.de>	
	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>	
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>	
	<4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com>	
	<fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>	
	<450AC38F.4080005@gmail.com>
	<fb6fbf560609151004g5f285483x5733f3774e9622f1@mail.gmail.com>
	<450B6C29.8060107@gmail.com>
Message-ID: <450B9A85.6040904@v.loewis.de>

Nick Coghlan schrieb:
> That way the internal representation of a string would only need to grow
> one extra field (the one saying how many bytes there are per character),
> and the internal state would remain immutable.

You could play tricks with ob_size to save this field:

- ob_size < 0: 8-bit data; length is abs(ob_size)
- ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2
- ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2

The first representation constrains the length of an 8-bit
representation to max_ssize_t, which is also the limit today.
For 16-bit strings, the limit is max_ssize_t/2, which means
max_ssize_t bytes; this is technically more constraining, but
such a string would still consume half of the address space,
and is unlikely to get created (*). For 32-bit strings, the
limit is also max_ssize_t/2, yet the maximum string would
require more than 2*max_ssize_t (==max_size_t) bytes, so
this isn't a real limitation.

> For 8-bit source data, 'latin-1' would then be the most efficient
> encoding, in that it would be a simple memcpy from the bytes object's
> internal buffer to the string object's internal buffer. Other encodings
> like 'koi8-r' would be decoded to either latin-1, UCS-2 or UCS-4
> depending on the largest code point in the source data.

This might somewhat slow-down codecs, which would have to scan the input
string first to find out what the maximum code point is, where they
currently can decode in a single pass. Of course, for multi-byte codecs,
such scanning is a good idea, anyway (some currently overallocate just
to avoid the second pass).

Regards,
Martin

(*) Many systems don't allow such large memory blocks,anyway.
E.g. on 32-bit Windows, in the standard configuration, the
address space is "only" 2GB.

From jcarlson at uci.edu  Sat Sep 16 10:22:43 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sat, 16 Sep 2006 01:22:43 -0700
Subject: [Python-3000] string C API
In-Reply-To: <450B4E3A.7000005@canterbury.ac.nz>
References: <20060915153702.F98F.JCARLSON@uci.edu>
	<450B4E3A.7000005@canterbury.ac.nz>
Message-ID: <20060915183617.F995.JCARLSON@uci.edu>


Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> 
> Josiah Carlson wrote:
> > Because all text objects are internally
> > represented in its minimal 'encoding', equal text objects will always be
> > in the same encoding.
> 
> That places a burden on all creators of strings to ensure
> that they are in the minimal format, which could be
> inconvenient for some operations, e.g. taking a substring
> could require making an extra pass to re-code the data.

If Martin says it's not a big deal, I'm not really all that concerned.


> It would also preclude the possibility of representing
> a substring as a view.

It doesn't preclude views.  Every operation works as before, only now
one would need to compare contents even on unequal-width code points.


> I don't see any great advantage given by this restriction
> anyway. So you could tell two strings were unequal in
> some cases if they happened to have different storage
> formats, but there would still be plenty of cases
> where you did have to compare them. Doesn't look like
> a big deal to me.

It is ultimately about space savings, and in the case of names (since
all will be 8-bit), perhaps even a bit faster to look up in the
interning table (I believe it is easier to hash 8 chars than 8 shorts).

 - Josiah


From qrczak at knm.org.pl  Sat Sep 16 11:53:51 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sat, 16 Sep 2006 11:53:51 +0200
Subject: [Python-3000] string C API
In-Reply-To: <450B4E3A.7000005@canterbury.ac.nz> (Greg Ewing's message of
	"Sat, 16 Sep 2006 13:07:06 +1200")
References: <20060915102433.F980.JCARLSON@uci.edu>
	<fb6fbf560609151437l714541a9t933e4ef865aaf12b@mail.gmail.com>
	<20060915153702.F98F.JCARLSON@uci.edu>
	<450B4E3A.7000005@canterbury.ac.nz>
Message-ID: <874pv8i1cg.fsf@qrnik.zagroda>

Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

> That places a burden on all creators of strings to ensure
> that they are in the minimal format, which could be
> inconvenient for some operations, e.g. taking a substring
> could require making an extra pass to re-code the data.

Yes, but taking a substring already requires a linear time wrt. the
length of the substring.

Allocation a string from a C array of wide characters (which
determines the format from the contents) will be written once and
called as a function.

Most strings are ASCII, so most of the time there is no need to check
whether the substring could become even narrower.

> It would also preclude the possibility of representing
> a substring as a view.

If views were implemented on the level of C pointers, then views would
not have the property of being in the canonical representation wrt.
character width. It's still valuable I think to use a more compact
representation if it would affect most strings.

> I don't see any great advantage given by this restriction
> anyway.

Keeping the canonical representation is not very important. It just
ensures that the advantage of having a more compact representation
taken as often as possible, even if the string has been cut from
another string which contained a wide character.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From qrczak at knm.org.pl  Sat Sep 16 12:02:37 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sat, 16 Sep 2006 12:02:37 +0200
Subject: [Python-3000] string C API
In-Reply-To: <450B9A85.6040904@v.loewis.de> (Martin v.
	=?iso-8859-2?q?L=F6wis's?= message of "Sat, 16 Sep 2006 08:32:37 +0200")
References: <ed968c$d03$1@sea.gmane.org>
	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>
	<45083C76.8010302@v.loewis.de>
	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>
	<45084966.3000608@v.loewis.de> <45092CC2.4070700@gmail.com>
	<4509CAEA.3040108@v.loewis.de> <450AAAD6.5030901@gmail.com>
	<fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
	<450AC38F.4080005@gmail.com>
	<fb6fbf560609151004g5f285483x5733f3774e9622f1@mail.gmail.com>
	<450B6C29.8060107@gmail.com> <450B9A85.6040904@v.loewis.de>
Message-ID: <87zmd0gmde.fsf@qrnik.zagroda>

"Martin v. L?wis" <martin at v.loewis.de> writes:

> You could play tricks with ob_size to save this field:
>
> - ob_size < 0: 8-bit data; length is abs(ob_size)
> - ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2
> - ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2

I wonder whether strings with characters outside ISO-8859-1 are common
enough that having a 16-bit representation is worth the trouble.

CLISP does have it. My language doesn't.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From martin at v.loewis.de  Sat Sep 16 15:43:36 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 16 Sep 2006 15:43:36 +0200
Subject: [Python-3000] string C API
In-Reply-To: <20060915183617.F995.JCARLSON@uci.edu>
References: <20060915153702.F98F.JCARLSON@uci.edu>	<450B4E3A.7000005@canterbury.ac.nz>
	<20060915183617.F995.JCARLSON@uci.edu>
Message-ID: <450BFF88.3060902@v.loewis.de>

Josiah Carlson schrieb:
>> That places a burden on all creators of strings to ensure
>> that they are in the minimal format, which could be
>> inconvenient for some operations, e.g. taking a substring
>> could require making an extra pass to re-code the data.
> 
> If Martin says it's not a big deal, I'm not really all that concerned.

I was thinking about codecs specifically: they often need to make
multiple passes anyway.

In general, only measurements can tell the performance impacts of
some design decision (e.g. it's non-obvious how often the various
string operations occur, and what the performance impact is).

There is also an issue of convenience here; however, with three
different representations, library functions could be provided
to support all cases.

> It is ultimately about space savings, and in the case of names (since
> all will be 8-bit), perhaps even a bit faster to look up in the
> interning table (I believe it is easier to hash 8 chars than 8 shorts).

That you need to demonstrate through profiling. First, strings likely
continue to keep their hash, and then it seems plausible that the cost
for hashing is in the computation and the loop, not in the memory
access, and that the computation is carried out in 32-bit registers
regardless of character width.

Regards,
Martin

From martin at v.loewis.de  Sat Sep 16 15:49:29 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 16 Sep 2006 15:49:29 +0200
Subject: [Python-3000] string C API
In-Reply-To: <450B739C.20607@gmail.com>
References: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>	<bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>	<20060915102555.F983.JCARLSON@uci.edu>	<1158343473.4292.14.camel@fsol>
	<450B739C.20607@gmail.com>
Message-ID: <450C00E9.6070008@v.loewis.de>

Nick Coghlan schrieb:
> The choice of latin-1 is deliberate and non-arbitrary. The reason for the 
> choice is that the ordinals 0-255 in latin-1 map to the Unicode code points 0-255:

That's true, but that this makes a good choice for a special case
doesn't follow. Instead, frequency of occurrence of the special case
makes it a good choice.

> In effect, when creating the string, you would be doing something like this:
> 
>    if encoding == 'latin-1':
>        bytes_per_char = 1
>        code_points = 8_bit_data
>    else:
>        code_points, max_code_point = decode_to_UCS4(8_bit_data, encoding)
>        if max_code_point < 256:
>            bytes_per_char = 1
>        elif max_code_point < 65536:
>            bytes_per_char = 2
>        else:
>            bytes_per_char = 4

Hardly. Instead, the codec would have to create the string of the right
width; a codec written in C would make two passes, rather than
temporarily allocating memory to actually represent the UCS-4 codes.

Regards,
Martin

From martin at v.loewis.de  Sat Sep 16 15:55:47 2006
From: martin at v.loewis.de (=?ISO-8859-2?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 16 Sep 2006 15:55:47 +0200
Subject: [Python-3000] string C API
In-Reply-To: <87zmd0gmde.fsf@qrnik.zagroda>
References: <ed968c$d03$1@sea.gmane.org>	<fb6fbf560609131009saa1fd17laca602a5e0fceba0@mail.gmail.com>	<45083C76.8010302@v.loewis.de>	<fb6fbf560609131027i531869d3hae25e33b2562f86b@mail.gmail.com>	<45084966.3000608@v.loewis.de>
	<45092CC2.4070700@gmail.com>	<4509CAEA.3040108@v.loewis.de>
	<450AAAD6.5030901@gmail.com>	<fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>	<450AC38F.4080005@gmail.com>	<fb6fbf560609151004g5f285483x5733f3774e9622f1@mail.gmail.com>	<450B6C29.8060107@gmail.com>
	<450B9A85.6040904@v.loewis.de> <87zmd0gmde.fsf@qrnik.zagroda>
Message-ID: <450C0263.3090506@v.loewis.de>

Marcin 'Qrczak' Kowalczyk schrieb:
>> You could play tricks with ob_size to save this field:
>>
>> - ob_size < 0: 8-bit data; length is abs(ob_size)
>> - ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2
>> - ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2
> 
> I wonder whether strings with characters outside ISO-8859-1 are common
> enough that having a 16-bit representation is worth the trouble.
> 
> CLISP does have it. My language doesn't.

The design of Unicode is so that all "living" scripts are encoded with
the BMP. So four-byte characters would be extremely rare, and one may
argue that encoding them with UTF-16 is good enough.

So if there is flexibility in the internal representation of strings,
I think a two-byte representation should definitely be one of the
options; I'd rather debate about the necessity of one-byte and
four-byte representations.

Regards,
Martin


From ncoghlan at gmail.com  Sat Sep 16 18:49:36 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sun, 17 Sep 2006 02:49:36 +1000
Subject: [Python-3000] string C API
In-Reply-To: <450C00E9.6070008@v.loewis.de>
References: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>	<bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>	<20060915102555.F983.JCARLSON@uci.edu>	<1158343473.4292.14.camel@fsol>
	<450B739C.20607@gmail.com> <450C00E9.6070008@v.loewis.de>
Message-ID: <450C2B20.4090504@gmail.com>

Martin v. L?wis wrote:
> Nick Coghlan schrieb:
>> The choice of latin-1 is deliberate and non-arbitrary. The reason for the 
>> choice is that the ordinals 0-255 in latin-1 map to the Unicode code points 0-255:
> 
> That's true, but that this makes a good choice for a special case
> doesn't follow. Instead, frequency of occurrence of the special case
> makes it a good choice.

If an 8-bit encoding other than latin-1 is used for the internal buffer, then 
every comparison operation would have to decode the string to Unicode in order 
to compare code points.

It seems much simpler to me to ensure that what is stored internally is 
*always* the Unicode code points, with the width (1, 2 or 4 bytes) determined 
by the largest code point in the string. The latter two are the UCS-2 and 
UCS-4 formats that are compile-time selectable for unicode strings in Python 
2.x, but I'm not aware of any name other than 'latin-1' for the case where all 
of the code points are less than 256.

> Hardly. Instead, the codec would have to create the string of the right
> width; a codec written in C would make two passes, rather than
> temporarily allocating memory to actually represent the UCS-4 codes.

Indeed, that does make more sense - one pass to figure out the number of 
characters and the largest code point, and a second to copy the characters to 
the allocated buffer.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From martin at v.loewis.de  Sat Sep 16 20:01:28 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 16 Sep 2006 20:01:28 +0200
Subject: [Python-3000] string C API
In-Reply-To: <450C2B20.4090504@gmail.com>
References: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>	<bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>	<20060915102555.F983.JCARLSON@uci.edu>	<1158343473.4292.14.camel@fsol>
	<450B739C.20607@gmail.com> <450C00E9.6070008@v.loewis.de>
	<450C2B20.4090504@gmail.com>
Message-ID: <450C3BF8.2030901@v.loewis.de>

Nick Coghlan schrieb:
> If an 8-bit encoding other than latin-1 is used for the internal buffer,
> then every comparison operation would have to decode the string to
> Unicode in order to compare code points.
> 
> It seems much simpler to me to ensure that what is stored internally is
> *always* the Unicode code points, with the width (1, 2 or 4 bytes)
> determined by the largest code point in the string.

Just try implementing comparison some time. You can end up implementing
the same algorithm six times at least, once for each pair (1,1), (1,2),
(1,4), (2,2), (2,4), (4,4). If the algorithm isn't symmetric (i.e.
you can't reduce (2,1) to (1,2)), you need 9 different versions of the
algorithm. That sounds more complicated than always decoding.

Regards,
Martin

From jcarlson at uci.edu  Sat Sep 16 20:51:33 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sat, 16 Sep 2006 11:51:33 -0700
Subject: [Python-3000] string C API
In-Reply-To: <450C3BF8.2030901@v.loewis.de>
References: <450C2B20.4090504@gmail.com> <450C3BF8.2030901@v.loewis.de>
Message-ID: <20060916114123.F99E.JCARLSON@uci.edu>


"Martin v. L?wis" <martin at v.loewis.de> wrote:
> 
> Nick Coghlan schrieb:
> > If an 8-bit encoding other than latin-1 is used for the internal buffer,
> > then every comparison operation would have to decode the string to
> > Unicode in order to compare code points.
> > 
> > It seems much simpler to me to ensure that what is stored internally is
> > *always* the Unicode code points, with the width (1, 2 or 4 bytes)
> > determined by the largest code point in the string.
> 
> Just try implementing comparison some time. You can end up implementing
> the same algorithm six times at least, once for each pair (1,1), (1,2),
> (1,4), (2,2), (2,4), (4,4). If the algorithm isn't symmetric (i.e.
> you can't reduce (2,1) to (1,2)), you need 9 different versions of the
> algorithm. That sounds more complicated than always decoding.

One algorithm.  Each character can be "decoded" during runtime.

long expand(void* buffer, Py_ssize_t posn, int shift) {
    buffer += posn << shift;
    switch (bpc) {
    case 0:  return ((unsigned char*)buffer)[0];
    case 1:  return ((unsigned short*)buffer)[0];
    case 2:  return ((long*)buffer)[0];
    default: return -1;
    }

Alternatively, with a little work, the 9 variants can be defined with a
prototype system, using macros or otherwise.


 - Josiah


From qrczak at knm.org.pl  Sat Sep 16 23:20:44 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sat, 16 Sep 2006 23:20:44 +0200
Subject: [Python-3000] string C API
In-Reply-To: <450C3BF8.2030901@v.loewis.de> (Martin v.
	=?iso-8859-2?q?L=F6wis's?= message of "Sat, 16 Sep 2006 20:01:28 +0200")
References: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
	<bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>
	<20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol>
	<450B739C.20607@gmail.com> <450C00E9.6070008@v.loewis.de>
	<450C2B20.4090504@gmail.com> <450C3BF8.2030901@v.loewis.de>
Message-ID: <87psdvik43.fsf@qrnik.zagroda>

"Martin v. L?wis" <martin at v.loewis.de> writes:

> Just try implementing comparison some time. You can end up implementing
> the same algorithm six times at least, once for each pair (1,1), (1,2),
> (1,4), (2,2), (2,4), (4,4). If the algorithm isn't symmetric (i.e.
> you can't reduce (2,1) to (1,2)), you need 9 different versions of the
> algorithm. That sounds more complicated than always decoding.

That's why I'm proposing only two variants, ISO-8859-1 and UCS-4.

String equality: two variants. Two others are trivial if the
representation is always canonical.

String < and <=: 8 variants in total, all generated from a single
20-line piece of C code, parametrized by preprocessor macros.

String !=, >, >=: defined in terms of the above.

String concatenation:
   if both strings are narrow:
      allocate a narrow result
      copy narrow from str1 to result
      copy narrow from str2 to result
   else:
      allocate a wide result
      if str1 is narrow:
         copy narrow->wide from str1 to result
      else:
         copy wide from str1 to result
      if str2 is narrow:
         copy narrow->wide from str2 to result
      else:
         copy wide from str2 to result

__contains__, startswith, index: three variants, one other is trivial.

Seems simple enough for me.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From greg.ewing at canterbury.ac.nz  Sun Sep 17 01:17:35 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sun, 17 Sep 2006 11:17:35 +1200
Subject: [Python-3000] string C API
In-Reply-To: <450C3BF8.2030901@v.loewis.de>
References: <fb6fbf560609150725y79c7b4eaw5300fb256502e482@mail.gmail.com>
	<bb8868b90609150922g70a4baa1j858f03a3581fad62@mail.gmail.com>
	<20060915102555.F983.JCARLSON@uci.edu> <1158343473.4292.14.camel@fsol>
	<450B739C.20607@gmail.com> <450C00E9.6070008@v.loewis.de>
	<450C2B20.4090504@gmail.com> <450C3BF8.2030901@v.loewis.de>
Message-ID: <450C860F.9010605@canterbury.ac.nz>

Martin v. L?wis wrote:
> Just try implementing comparison some time. You can end up implementing
> the same algorithm six times at least, once for each pair (1,1), (1,2),
> (1,4), (2,2), (2,4), (4,4).

#define UnicodeStringComparisonFunction(TYPE1, TYPE2) \
   /* code to implement it here */

UnicodeStringComparisonFunction(UCS1, UCS1)
UnicodeStringComparisonFunction(UCS1, UCS2)
UnicodeStringComparisonFunction(UCS1, UCS4)
UnicodeStringComparisonFunction(UCS2, UCS2)
UnicodeStringComparisonFunction(UCS2, UCS4)
UnicodeStringComparisonFunction(UCS4, UCS4)

--
Greg

From meyer at acm.org  Sun Sep 17 14:28:08 2006
From: meyer at acm.org (Andre Meyer)
Date: Sun, 17 Sep 2006 14:28:08 +0200
Subject: [Python-3000] Kill GIL?
Message-ID: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>

Dear Python experts

As a heavy user of multi-threading in Python and following the current
discussions about Python on multi-processor systems on the python-list I
wonder what the plans are for improving MP performance in Py3k. MP systems
become more and more common as most modern processors have multiple
processing units that could be used in parallel by distributing threads.
Unfortunately, the GIL in CPython prevents to use this mechanism. As far as
I understand IronPython, Jython and PyPy do not suffer from this.

While I understand the difficulties in removing the GIL and the potential
negative effect on single-threaded applications I would very much encourage
discussion to seriously consider removing the GIL (maybe optionally) in
Py3k. If not, what alternatives would you suggest?

thanks a lot for your thoughts
Andre
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060917/a1ee7948/attachment.htm 

From ncoghlan at gmail.com  Sun Sep 17 15:16:30 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sun, 17 Sep 2006 23:16:30 +1000
Subject: [Python-3000] Kill GIL?
In-Reply-To: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
Message-ID: <450D4AAE.2000805@gmail.com>

Andre Meyer wrote:
> While I understand the difficulties in removing the GIL and the 
> potential negative effect on single-threaded applications I would very 
> much encourage discussion to seriously consider removing the GIL (maybe 
> optionally) in Py3k. If not, what alternatives would you suggest?

Brett Cannon's sandboxing work (which aims to provide first-class support for 
multiple interpreters in the same process for security reasons) also seems 
like a potentially fruitful approach to distributing processing to multiple cores:
   - use threads to perform blocking I/O in parallel
   - use multiple interpreters to perform Python execution in parallel

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From qrczak at knm.org.pl  Sun Sep 17 15:43:50 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sun, 17 Sep 2006 15:43:50 +0200
Subject: [Python-3000] Kill GIL?
In-Reply-To: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	(Andre Meyer's message of "Sun, 17 Sep 2006 14:28:08 +0200")
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
Message-ID: <87irjmk3qh.fsf@qrnik.zagroda>

"Andre Meyer" <meyer at acm.org> writes:

> While I understand the difficulties in removing the GIL and the
> potential negative effect on single-threaded applications I would
> very much encourage discussion to seriously consider removing the
> GIL (maybe optionally) in Py3k.

I suppose this would require either fundamentally changing the garbage
collection algorithm (lots of work and breaking all C extensions),
or accompanying all reference count adjustments with memory barriers
(a significant performance hit even if a particular object is not
shared between threads; many objects like None will be shared anyway).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From solipsis at pitrou.net  Sun Sep 17 16:51:12 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Sun, 17 Sep 2006 16:51:12 +0200
Subject: [Python-3000] Kill GIL?
In-Reply-To: <450D4AAE.2000805@gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	<450D4AAE.2000805@gmail.com>
Message-ID: <1158504672.28528.82.camel@fsol>

Le dimanche 17 septembre 2006 ? 23:16 +1000, Nick Coghlan a ?crit :
> Brett Cannon's sandboxing work (which aims to provide first-class support for 
> multiple interpreters in the same process for security reasons) also seems 
> like a potentially fruitful approach to distributing processing to multiple cores:
>    - use threads to perform blocking I/O in parallel

OTOH, the Twisted approach avoids all the delicate synchronization
issues that arise when using threads to perform concurrent IO tasks.

Also, IO is by definition not CPU-intensive, so there is no point in
distributing IO to multiple cores (and it could even cause a small
decrease in performance because of inter-CPU communication overhead).

Regards

Antoine.



From jcarlson at uci.edu  Sun Sep 17 19:56:15 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sun, 17 Sep 2006 10:56:15 -0700
Subject: [Python-3000] Kill GIL?
In-Reply-To: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
Message-ID: <20060917103103.F9A4.JCARLSON@uci.edu>


"Andre Meyer" <meyer at acm.org> wrote:
> Dear Python experts
> 
> As a heavy user of multi-threading in Python and following the current
> discussions about Python on multi-processor systems on the python-list I
> wonder what the plans are for improving MP performance in Py3k. MP systems
> become more and more common as most modern processors have multiple
> processing units that could be used in parallel by distributing threads.
> Unfortunately, the GIL in CPython prevents to use this mechanism. As far as
> I understand IronPython, Jython and PyPy do not suffer from this.
> 
> While I understand the difficulties in removing the GIL and the potential
> negative effect on single-threaded applications I would very much encourage
> discussion to seriously consider removing the GIL (maybe optionally) in
> Py3k. If not, what alternatives would you suggest?

Search for 'Python free threading' without quotes in Google to find the
discussions about this topic over the years.

Personally, I think that the effort to remove the GIL in Py3k (or
otherwise) is quite a bit of trouble that we don't want to have to go
through; both from an internal redesign, and C-extension perspective.

It would be substantially easier if there were a distributed RPC
mechanism that auto distributed to the "least-working" process in a set
of potential working processes on a single machine.  Something with the
simplicity of XML-RPC calling (but without servers and clients) and the
distribution properties of Linda. Of course then we run into a situation
where we need to "pickle" the callable arguments across a connection of
some kind. There is a solution to this on a single machine; copying the
internal representation of every object in the arguments of a function
call to memory shared between all processes (mmap).  With such a
semantic, only mutable portions need to be copied out into non-mmap
memory.

With that RPC mechanism and file handle migration (available on BSDs
natively, linux with minor work, and Windows via pywin32), most
operations would *just work*, would be reasonably fast, and Python could
keep its GIL - which would be substantially less work for everyone
involved.


The details are cumbersome (specifically the copying Python objects
to/from memory), but they can be made less cumbersome if one only allows
builtin objects to be transferred.

 - Josiah


From brett at python.org  Sun Sep 17 20:03:34 2006
From: brett at python.org (Brett Cannon)
Date: Sun, 17 Sep 2006 11:03:34 -0700
Subject: [Python-3000] Kill GIL?
In-Reply-To: <450D4AAE.2000805@gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	<450D4AAE.2000805@gmail.com>
Message-ID: <bbaeab100609171103w88e64a6k1a0bad9887b282a4@mail.gmail.com>

On 9/17/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
>
> Andre Meyer wrote:
> > While I understand the difficulties in removing the GIL and the
> > potential negative effect on single-threaded applications I would very
> > much encourage discussion to seriously consider removing the GIL (maybe
> > optionally) in Py3k. If not, what alternatives would you suggest?
>
> Brett Cannon's sandboxing work (which aims to provide first-class support
> for
> multiple interpreters in the same process for security reasons) also seems
> like a potentially fruitful approach to distributing processing to
> multiple cores:
>    - use threads to perform blocking I/O in parallel
>    - use multiple interpreters to perform Python execution in parallel


Possibly, but as it stands now interpreters just execute in their own Python
thread, so there is no real performance boost.  Without the GIL shifting
over to per interpreter instead of per process there is going to be the same
performance problems as with Python threads.  And changing that would be
hard since objects can be shared between multiple interpreters.

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060917/8395bdfb/attachment.html 

From ronaldoussoren at mac.com  Sun Sep 17 20:36:40 2006
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Sun, 17 Sep 2006 20:36:40 +0200
Subject: [Python-3000] Kill GIL?
In-Reply-To: <450D4AAE.2000805@gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	<450D4AAE.2000805@gmail.com>
Message-ID: <A8560AE1-801E-4F04-B7EF-61F489090354@mac.com>


On Sep 17, 2006, at 3:16 PM, Nick Coghlan wrote:

> Andre Meyer wrote:
>> While I understand the difficulties in removing the GIL and the
>> potential negative effect on single-threaded applications I would  
>> very
>> much encourage discussion to seriously consider removing the GIL  
>> (maybe
>> optionally) in Py3k. If not, what alternatives would you suggest?
>
> Brett Cannon's sandboxing work (which aims to provide first-class  
> support for
> multiple interpreters in the same process for security reasons)  
> also seems
> like a potentially fruitful approach to distributing processing to  
> multiple cores:
>    - use threads to perform blocking I/O in parallel
>    - use multiple interpreters to perform Python execution in parallel

... except when you use extensions that use the PyGILState APIs,  
those don't work with multiple interpreters :-(.

Ronald

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2157 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20060917/0d132cf9/attachment.bin 

From rasky at develer.com  Sun Sep 17 23:58:57 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Sun, 17 Sep 2006 23:58:57 +0200
Subject: [Python-3000] Kill GIL?
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	<20060917103103.F9A4.JCARLSON@uci.edu>
Message-ID: <007501c6daa4$79768520$a14c2597@bagio>

Josiah Carlson <jcarlson at uci.edu> wrote:

> It would be substantially easier if there were a distributed RPC
> mechanism that auto distributed to the "least-working" process in a
> set
> of potential working processes on a single machine.  [...]

I'm not sure I follow you. Would you mind providing an example of a plausible
API for this mechanism (aka how the code would look like, compared to the
current Python threading classes)?

Giovanni Bajo


From jcarlson at uci.edu  Mon Sep 18 03:18:32 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sun, 17 Sep 2006 18:18:32 -0700
Subject: [Python-3000] Kill GIL?
In-Reply-To: <007501c6daa4$79768520$a14c2597@bagio>
References: <20060917103103.F9A4.JCARLSON@uci.edu>
	<007501c6daa4$79768520$a14c2597@bagio>
Message-ID: <20060917180402.07DB.JCARLSON@uci.edu>


"Giovanni Bajo" <rasky at develer.com> wrote:
> Josiah Carlson <jcarlson at uci.edu> wrote:
> 
> > It would be substantially easier if there were a distributed RPC
> > mechanism that auto distributed to the "least-working" process in a
> > set
> > of potential working processes on a single machine.  [...]
> 
> I'm not sure I follow you. Would you mind providing an example of a plausible
> API for this mechanism (aka how the code would look like, compared to the
> current Python threading classes)?

    import autorpc
    caller = autorpc.init_processes(autorpc.num_processors())

    import callables
    caller.register_module(callables)

    result = caller.fcn1(arg1, arg2, arg3)

The point is not to compare API/etc., with threading, but to compare it
with XMLRPC.  Because ultimately, what I would like to see, is a
mechanic similar to XMLRPC; call a method on an instance, that is
automatically executed perhaps in some other thread in some other
process, or maybe even in the same thread on the same process (depending
on load, etc.), and which returns the result in-place.

It's just much easier to handle (IMO).  The above example highlights an
example of single call/return.  What if you don't care about getting a
result back before continuing, or perhaps you have a bunch of things you
want to get done?

    ...
    q = Queue.Queue()

    caller.delayed(q.put).fcn1(arg1, arg2, arg3)
    r = q.get() #will be delayed until q gets something

What to do about exceptions happening in fcn1 remotely?  A fellow over
in the wxPython mailing list brought up the idea of exception objects;
perhaps not stackframes, etc., but perhaps an object with information
like exception type and traceback, used for both delayed and non-delayed
tracebacks.


 - Josiah


From ross at sourcelabs.com  Mon Sep 18 06:06:34 2006
From: ross at sourcelabs.com (Ross Jekel)
Date: Sun, 17 Sep 2006 21:06:34 -0700
Subject: [Python-3000] Kill GIL?
In-Reply-To: <mailman.530.1158530744.10490.python-3000@python.org>
References: <mailman.530.1158530744.10490.python-3000@python.org>
Message-ID: <courier.00000000450E1B4A.00002758@mail-1.colo.sourcelabs.com>

I know it is a bit old, but would Python Object Sharing (POSH) 
http://poshmodule.sourceforge.net help you? Also, if you think you like the 
current state-of-the-art threading model, you might not after reading this: 

http://tinyurl.com/qvcbr 

This goes to an article on http://www.computer.org with a long URL entitled 
"The Problem with Threads." 

After some initial surprise when I learned about it, I'm now okay with a GIL 
or even single threaded python (with async I/O if necessary). In my opinion 
threaded programs with one big shared data space (like CPython's) are 
fundamentally untestable and unverifiable, and the GIL was the best solution 
to reduce risk in that area. I am happy the GIL exists because it forces me 
to come up designs for programs and systems that are easier to write, more 
predictable both in terms of correctness and performance, and easier to 
maintain and scale. I think there would be significant backlash in the 
Python development community the first time an intermittent race condition 
or a deadlock occurs in the CPython interpretor after years of relying on it 
as a predictable, reliable platform. 

I'm also happy the GIL exists because it forces alternative ideas like 
Twisted and stackless to be developed and tried. 

If you have shared data that really benefits from synchronized access and 
updates, write an extension, release the GIL at the appropriate places, and 
do whatever you want in a C data structure. I've done this when necessary 
and think it is the best of both worlds. I guess I'm assuming this will 
still be possible in Python 3000 (I haven't been on the list that long, 
sorry). 

There has to be a better concurrency model than threads. Let's design for 
the future with useful packages that implement the best ideas of today for 
scaling well without threads. 

Ross

From ironfroggy at gmail.com  Mon Sep 18 06:50:34 2006
From: ironfroggy at gmail.com (Calvin Spealman)
Date: Mon, 18 Sep 2006 00:50:34 -0400
Subject: [Python-3000] Kill GIL?
In-Reply-To: <20060917180402.07DB.JCARLSON@uci.edu>
References: <20060917103103.F9A4.JCARLSON@uci.edu>
	<007501c6daa4$79768520$a14c2597@bagio>
	<20060917180402.07DB.JCARLSON@uci.edu>
Message-ID: <76fd5acf0609172150o55e79fddta141e348bffb342@mail.gmail.com>

On 9/17/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Giovanni Bajo" <rasky at develer.com> wrote:
> > Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > > It would be substantially easier if there were a distributed RPC
> > > mechanism that auto distributed to the "least-working" process in a
> > > set
> > > of potential working processes on a single machine.  [...]
> >
> > I'm not sure I follow you. Would you mind providing an example of a plausible
> > API for this mechanism (aka how the code would look like, compared to the
> > current Python threading classes)?
>
>     import autorpc
>     caller = autorpc.init_processes(autorpc.num_processors())
>
>     import callables
>     caller.register_module(callables)
>
>     result = caller.fcn1(arg1, arg2, arg3)
>
> The point is not to compare API/etc., with threading, but to compare it
> with XMLRPC.  Because ultimately, what I would like to see, is a
> mechanic similar to XMLRPC; call a method on an instance, that is
> automatically executed perhaps in some other thread in some other
> process, or maybe even in the same thread on the same process (depending
> on load, etc.), and which returns the result in-place.
>
> It's just much easier to handle (IMO).  The above example highlights an
> example of single call/return.  What if you don't care about getting a
> result back before continuing, or perhaps you have a bunch of things you
> want to get done?
>
>     ...
>     q = Queue.Queue()
>
>     caller.delayed(q.put).fcn1(arg1, arg2, arg3)
>     r = q.get() #will be delayed until q gets something
>
> What to do about exceptions happening in fcn1 remotely?  A fellow over
> in the wxPython mailing list brought up the idea of exception objects;
> perhaps not stackframes, etc., but perhaps an object with information
> like exception type and traceback, used for both delayed and non-delayed
> tracebacks.
>
>
>  - Josiah

I would be thrilled to see this kind of api brought into python. It
could very likely be implemented in time for Python 2.6, which would
be spawning processes to handle the load. At the very least, a Python
2.4 or older compatible module could be release to test the waters and
see what works and doesnt in this idea. I tried to wrap my head around
different options on how the GIL might go away, but the in end you
just realize you would hate to see  it go away.

I'm sure Twisted would have a field day with such a facility.

If this kind of thing gets brought into Python, it would almost
require some form of MapReduce come along with it. Of course, with the
existing talks about removing map as a built-in in favor of list
comprehensions, it makes one consider if listcomps and genexps might
have some way to utilize a distributed model natively. Some
consideration toward that end would be valuable.

From jcarlson at uci.edu  Mon Sep 18 07:14:26 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sun, 17 Sep 2006 22:14:26 -0700
Subject: [Python-3000] Kill GIL?
In-Reply-To: <courier.00000000450E1B4A.00002758@mail-1.colo.sourcelabs.com>
References: <mailman.530.1158530744.10490.python-3000@python.org>
	<courier.00000000450E1B4A.00002758@mail-1.colo.sourcelabs.com>
Message-ID: <20060917212744.07DE.JCARLSON@uci.edu>


"Ross Jekel" <ross at sourcelabs.com> wrote:
> I know it is a bit old, but would Python Object Sharing (POSH) 
> http://poshmodule.sourceforge.net help you? Also, if you think you like the 
> current state-of-the-art threading model, you might not after reading this: 

The RPC-like mechanism I described earlier could be implemented on top
of POSH if desired, though I believe that some of the potential issues
that POSH has yet to fix (see its TODO file) aren't as much of a concern
(if at all) when only using shared memory as IPC and not as an object
store.

> "The Problem with Threads."

Getting *concurrency* right is hard.  One way of making it no so hard is
to use simple abstractions; like producer/consumer, deferred results,
etc.  But not everything fits into these abstractions, and sometimes
there are data structure manipulations that require locking. In that
sense, it's not so much that *concurrency* is hard to get right, as much
as locking is hard to get right.

But with Python 2.5, we get the 'with' statement and context managers. 
Add context managers to locks, always use RLocks (so that you can
.acquire() a lock multiple times), and while it hasn't gotten easy (one
needs to be careful with lock acquisition order to prevent deadlocks,
especially when mixing locks with queues), more concurrency tasks have
gotten *easier*.


Essentially the article points out that using abstractions like
producer/consumer, deferreds, etc., can make concurrent programming not
so hard, and you have to be basically insane to use threads in your
concurrent programming (I've been doing it for about 7 years, and am
thoroughly insane), but unless I'm missing something (I only skimmed the
article when it first came out, so this is quite possible), it's not
really saying anything new to the concurrent programmer (of nontrivial
systems).


With the API and RPC mechanism I sketched out earlier, threads are a
possible underlying implementation detail.  Essentially, it tries to
force everything into a producer/consumer abstraction; the function I
call is a consumer of the arguments I pass, and it produces a result
that I (or someone else) later consume.  This somewhat limits what kinds
of things can be done 'natively', but you can't get everything.


 - Josiah


From krstic at solarsail.hcs.harvard.edu  Mon Sep 18 07:55:59 2006
From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=)
Date: Mon, 18 Sep 2006 01:55:59 -0400
Subject: [Python-3000] Kill GIL?
In-Reply-To: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
Message-ID: <450E34EF.3090202@solarsail.hcs.harvard.edu>

Andre Meyer wrote:
> As a heavy user of multi-threading in Python and following the current
> discussions about Python on multi-processor systems on the python-list I
> wonder what the plans are for improving MP performance in Py3k. 

I have four aborted e-mails in my 'Drafts' folder that are asking the
same question; each time, I decided that the almost inevitably ensuing
"threads suck!" flamewar just isn't worth it. Now that someone else has
taken the plunge...

At present, the Python approach to multi-processing sounds a bit like
"let's stick our collective hands in the sand and pretend there's no
problem". In particular, one oft-parroted argument says that it's not
worth changing or optimizing the language for the few people who can
afford SMP hardware. In the meantime, dual-core laptops are becoming the
standard, with Intel predicting quad-core will become mainstream in the
next few years, and the number of server orders for single-core, UP
machines is plummeting.

>From this, it's obvious to me that we need to do *something* to
introduce stronger multi-processing support. Our current abilities are
rather bad: we offer no microthreads, which is making elegant
concurrency primitives such as Erlang's, ported to Python by the
Candygram project [0], unnecessarily expensive. Instead, we only offer
heavy threads that each allocate a full-size stack, and there's no
actual ability to parallelize thread execution across CPUs. There's also
no way to simply fork and coordinate between the forked processes,
depending on the nature of the problem being solved, since there's no
shared memory primitive in the stdlib (this because shared memory
semantics are notoriously different across platforms). On top of it all,
any adopted solution needs to be implementable across all the major
Python interpreters, which makes finding a solution that much harder.

The way I see it, we have several options:

* Bite the bullet; write and support a stdlib SHM primitive that works
wherever possible, and simply doesn't work on completely broken
platforms (I understand Windows falls into this category). Utilize it in
a lightweight fork-and-coordinate wrapper provided in the stdlib.

* Bite the mortar shell, and remove the GIL.

* Introduce microthreads, declare that Python endorses Erlang's
no-sharing approach to concurrency, and incorporate something like
candygram into the stdlib.

* Introduce a fork-and-coordinate wrapper in the stdlib, and declare
that we're simply not going to support the use case that requires
sharing (as opposed to merely passing) objects between processes.

The first option is a Pareto optimization, but having stdlib
functionality flat out unavailable on some platforms might be out of the
question. It'd be good to hear Guido's longer-term view on concurrency
in Python. That discussion might be more appropriate on python-dev, though.

Cheers,


[0] http://candygram.sourceforge.net/

-- 
Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> | GPG: 0x147C722D

From bob at redivi.com  Mon Sep 18 08:29:44 2006
From: bob at redivi.com (Bob Ippolito)
Date: Sun, 17 Sep 2006 23:29:44 -0700
Subject: [Python-3000] Kill GIL?
In-Reply-To: <450E34EF.3090202@solarsail.hcs.harvard.edu>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	<450E34EF.3090202@solarsail.hcs.harvard.edu>
Message-ID: <6a36e7290609172329s5bf83d2fq911508484e463d2e@mail.gmail.com>

On 9/17/06, Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> wrote:
> Andre Meyer wrote:
> > As a heavy user of multi-threading in Python and following the current
> > discussions about Python on multi-processor systems on the python-list I
> > wonder what the plans are for improving MP performance in Py3k.
>
> I have four aborted e-mails in my 'Drafts' folder that are asking the
> same question; each time, I decided that the almost inevitably ensuing
> "threads suck!" flamewar just isn't worth it. Now that someone else has
> taken the plunge...
>
> At present, the Python approach to multi-processing sounds a bit like
> "let's stick our collective hands in the sand and pretend there's no
> problem". In particular, one oft-parroted argument says that it's not
> worth changing or optimizing the language for the few people who can
> afford SMP hardware. In the meantime, dual-core laptops are becoming the
> standard, with Intel predicting quad-core will become mainstream in the
> next few years, and the number of server orders for single-core, UP
> machines is plummeting.
>
> From this, it's obvious to me that we need to do *something* to
> introduce stronger multi-processing support. Our current abilities are
> rather bad: we offer no microthreads, which is making elegant
> concurrency primitives such as Erlang's, ported to Python by the
> Candygram project [0], unnecessarily expensive. Instead, we only offer
> heavy threads that each allocate a full-size stack, and there's no
> actual ability to parallelize thread execution across CPUs. There's also
> no way to simply fork and coordinate between the forked processes,
> depending on the nature of the problem being solved, since there's no
> shared memory primitive in the stdlib (this because shared memory
> semantics are notoriously different across platforms). On top of it all,
> any adopted solution needs to be implementable across all the major
> Python interpreters, which makes finding a solution that much harder.

Candygram is heavyweight by trade-off, not because it has to be.
Candygram could absolutely be implemented efficiently in current
Python if a Twisted-like style was used. An API that exploits Python
2.5's with blocks and enhanced iterators would make it less verbose
than a traditional twisted app and potentially easier to learn.
Stackless or greenlets could be used for an even lighter weight API,
though not as portably.

> The way I see it, we have several options:
>
> * Bite the bullet; write and support a stdlib SHM primitive that works
> wherever possible, and simply doesn't work on completely broken
> platforms (I understand Windows falls into this category). Utilize it in
> a lightweight fork-and-coordinate wrapper provided in the stdlib.

I really don't think that's the right approach. If we're going to
bother supporting distributed processing, we might as well support it
in a portable way that can scale across machines.

> * Bite the mortar shell, and remove the GIL.

This really isn't even an option because we're not throwing away the
current C Python implementation. The C API would have to change quite
a bit for that.

> * Introduce microthreads, declare that Python endorses Erlang's
> no-sharing approach to concurrency, and incorporate something like
> candygram into the stdlib.

We have cooperatively scheduled microthreads with ugly syntax (yield),
or more platform-specific and much less debuggable microthreads with
stackless or greenlets.

The missing part is the async message passing API and the libraries to
go with it. Erlang uses something a lot like pickle for this, but
Erlang only has about 8 types that are all immutable (IIRC: function,
binary, list, tuple, pid, atom, integer, float). Communication between
Erlang nodes requires a cookie (shared secret), which skirts around
security issues. You can definitely kill an Erlang node if you have
its cookie by flooding the atom table (atoms are like interned
strings), but that's not considered to be a problem in most deployment
scenarios.

> * Introduce a fork-and-coordinate wrapper in the stdlib, and declare
> that we're simply not going to support the use case that requires
> sharing (as opposed to merely passing) objects between processes.

What use case *requires* sharing? In a message passing system, usage
of shared memory is an optimization that you shouldn't care much about
as a user. Also, sockets are generally very fast over loopback.

IIRC, Erlang only does this with binaries > 64 bytes long across
processes on the same node (same pid, but not necessarily the same
pthread in an SMP build). HiPE might do some more aggressive
communication optimizations... but I think the general idea is that
sending a really big message to another process is probably the wrong
thing to do anyway.

-bob

From martin at v.loewis.de  Mon Sep 18 08:44:41 2006
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Mon, 18 Sep 2006 08:44:41 +0200
Subject: [Python-3000] Kill GIL?
In-Reply-To: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
Message-ID: <450E4059.1000806@v.loewis.de>

Andre Meyer schrieb:
> While I understand the difficulties in removing the GIL and the
> potential negative effect on single-threaded applications I would very
> much encourage discussion to seriously consider removing the GIL (maybe
> optionally) in Py3k. If not, what alternatives would you suggest?

Encouraging "very much" is probably not good enough to make anything
happen. Actual code contributions may, as may offering a bounty
(although it probably depends on the size of the bounty whether anybody
 wants to collect it).

The alternatives are very straight-forward:
1. use Python the same way as you did for Python 2.x. I.e. create
   many threads, and have only one of them run. Use the other processors
   for something else, or don't use them at all.
2. use Python the same way as many other people do. Don't use threads,
   instead use multiple processors, and some sort of IPC.
3. don't use Python, at least not for the activities that need to
   run on multiple processors.
If you want to fully use your multiple processors, depending on the
application, I'd typically go with option 2 or 3. Option 2 if the code
to parallelize is written in Python, option 3 if it is written in C
(yes, you can use multiple truly concurrent threads in Python: just
 release the GIL on the C level; you can't make any calls into Python
 until you reacquire the GIL).

Regards,
Martin

From jcarlson at uci.edu  Mon Sep 18 09:25:57 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Mon, 18 Sep 2006 00:25:57 -0700
Subject: [Python-3000] Kill GIL?
In-Reply-To: <450E34EF.3090202@solarsail.hcs.harvard.edu>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	<450E34EF.3090202@solarsail.hcs.harvard.edu>
Message-ID: <20060918001556.07E4.JCARLSON@uci.edu>


Ivan Krstic <krstic at solarsail.hcs.harvard.edu> wrote:
> * Bite the bullet; write and support a stdlib SHM primitive that works
> wherever possible, and simply doesn't work on completely broken
> platforms (I understand Windows falls into this category). Utilize it in
> a lightweight fork-and-coordinate wrapper provided in the stdlib.

Shared memory as an object store, or as IPC?  Either way, shared mmaps
offer shared memory for most platforms.  Which ones?  Windows, linux,
OSX, solaris, BSDs, ... I would be surprised if Irix, AIX, HP-UX and
other "big iron" OSes /didn't/ support shared mmaps.  Sure, you don't
get it on little embedded machines, but I'm not sure if we want to worry
about concurrency libraries there.


Alternatively, for platforms that support it, I have found that
synchronous unix domain sockets can push about 3x as much as the
loopback interface, about 1 GBytes/second on a 3 ghz Xeon, vs. around
350 MBytes/second for loopback tcp/ip.  I haven't tried using
domain+tcp/ip as a synchronization/"check the mmap at offset X, length Y",
but I would imagine that it would be competitive.


 - Josiah


From krstic at solarsail.hcs.harvard.edu  Mon Sep 18 09:38:38 2006
From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=)
Date: Mon, 18 Sep 2006 03:38:38 -0400
Subject: [Python-3000] Kill GIL?
In-Reply-To: <6a36e7290609172329s5bf83d2fq911508484e463d2e@mail.gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>	
	<450E34EF.3090202@solarsail.hcs.harvard.edu>
	<6a36e7290609172329s5bf83d2fq911508484e463d2e@mail.gmail.com>
Message-ID: <450E4CFE.8080209@solarsail.hcs.harvard.edu>

Bob Ippolito wrote:
> Candygram is heavyweight by trade-off, not because it has to be.
> Candygram could absolutely be implemented efficiently in current
> Python if a Twisted-like style was used. 

Specifically?

>> * Bite the bullet; write and support a stdlib SHM primitive that works [..]
>> a lightweight fork-and-coordinate wrapper provided in the stdlib.
> 
> I really don't think that's the right approach. If we're going to
> bother supporting distributed processing, we might as well support it
> in a portable way that can scale across machines.

Fork-and-coordinate is a specialized case of distribute-and-coordinate.
Other d-a-c mechanisms can be provided, including those that utilize
some form of RPC as a transport. SHM is orthogonal to all of this.

Note that scaling across machines is only equivalent to scaling across
CPUs in the simple case; in more complicated cases, there's a lot of
glue involved that grid frameworks like Boinc provide. If we end up
shipping any cross-machine abilities in the stdlib, we'd have to make
sure it's clear that we're not attempting to provide a grid framework,
just the plumbing that someone could use to build one.

>> * Bite the mortar shell, and remove the GIL.
> 
> This really isn't even an option because we're not throwing away the
> current C Python implementation. The C API would have to change quite
> a bit for that.

Hence 'mortar shell'. It can be done, but I think Guido's been pretty
clear on it not happening anytime soon.

> We have cooperatively scheduled microthreads with ugly syntax (yield),
> or more platform-specific and much less debuggable microthreads with
> stackless or greenlets.

Right. This is why I'm not sure we want to be recommending either as
`the` Python way to do concurrency.

> What use case *requires* sharing? 

Strictly speaking, it's always avoidable. But in setup-heavy systems,
avoiding SHM is a massive and costly pain. Consider web applications;
ideally, you can preload one copy of all of your translations, database
information, and other static information, into RAM -- and have worker
threads do reads from this table as they're processing individual
requests. Without SHM, you'd have to either duplicate the static set in
memory for each CPU, or make individual requests for each desired piece
of information to the master process that keeps the static set in RAM.

I've seen a number of computationally-bound systems that require an
authoritative copy of a (large) dataset in RAM, and are OK with paying
the cost of a read waiting on a lock during a write (and since writes
only happen at the completion of complex calculations, they generally
want to use locking like that provided by brlocks in the Linux kernel).
All of this is workable without SHM, but some of it gets really unwieldy.

-- 
Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> | GPG: 0x147C722D

From ironfroggy at gmail.com  Mon Sep 18 09:45:05 2006
From: ironfroggy at gmail.com (Calvin Spealman)
Date: Mon, 18 Sep 2006 03:45:05 -0400
Subject: [Python-3000] Kill GIL?
In-Reply-To: <450E4CFE.8080209@solarsail.hcs.harvard.edu>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	<450E34EF.3090202@solarsail.hcs.harvard.edu>
	<6a36e7290609172329s5bf83d2fq911508484e463d2e@mail.gmail.com>
	<450E4CFE.8080209@solarsail.hcs.harvard.edu>
Message-ID: <76fd5acf0609180045r41fa6ef1tc78b70228f3c5fe@mail.gmail.com>

On 9/18/06, Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> wrote:
> > What use case *requires* sharing?
>
> Strictly speaking, it's always avoidable. But in setup-heavy systems,
> avoiding SHM is a massive and costly pain. Consider web applications;
> ideally, you can preload one copy of all of your translations, database
> information, and other static information, into RAM -- and have worker
> threads do reads from this table as they're processing individual
> requests. Without SHM, you'd have to either duplicate the static set in
> memory for each CPU, or make individual requests for each desired piece
> of information to the master process that keeps the static set in RAM.
>
> I've seen a number of computationally-bound systems that require an
> authoritative copy of a (large) dataset in RAM, and are OK with paying
> the cost of a read waiting on a lock during a write (and since writes
> only happen at the completion of complex calculations, they generally
> want to use locking like that provided by brlocks in the Linux kernel).
> All of this is workable without SHM, but some of it gets really unwieldy.

So reload the information you want available to worker tasks and pass
that information along to them, or provide a mechanism for them to
request it from its preloaded housing. Shared memory isn't the only or
best way to share resources between the tasks involved. Very rarely
would any worker task need more than a few rows of any large preloaded
table.

Alternatively, one could say you dont usually want any preloaded data
because there is simply too much information to preload and reusable
worker tasks can provide their own, more effectively targetted caches.
One might even consider some setup where-by worker threads report
their cache contents to a controller, which distributes tasks to
workers it knows to have the information required already cached.

But in the end, you have to realize this is all at a higher level than
we would really need to even consider for the discussion at hand.

From dialtone at divmod.com  Mon Sep 18 09:50:26 2006
From: dialtone at divmod.com (Valentino Volonghi aka Dialtone)
Date: Mon, 18 Sep 2006 09:50:26 +0200
Subject: [Python-3000] Kill GIL?
In-Reply-To: <20060917180402.07DB.JCARLSON@uci.edu>
Message-ID: <20060918075026.1717.707665485.divmod.quotient.52675@ohm>

On Sun, 17 Sep 2006 18:18:32 -0700, Josiah Carlson <jcarlson at uci.edu> wrote:
>    import autorpc
>    caller = autorpc.init_processes(autorpc.num_processors())
>
>    import callables
>    caller.register_module(callables)
>
>    result = caller.fcn1(arg1, arg2, arg3)
>
>The point is not to compare API/etc., with threading, but to compare it
>with XMLRPC.  Because ultimately, what I would like to see, is a
>mechanic similar to XMLRPC; call a method on an instance, that is
>automatically executed perhaps in some other thread in some other
>process, or maybe even in the same thread on the same process (depending
>on load, etc.), and which returns the result in-place.

I've written something similar taking inspiration from axiom.batch (from divmod.org). And the result is the following code:

import sys, os

from twisted.internet import reactor
from twisted.protocols import amp
from twisted.internet import protocol
from twisted.python import log
from epsilon import process

# These are the Commands, they are needed to call remote methods safely.
class Sum(amp.Command):
    arguments = [('a', amp.Integer()),
                 ('b', amp.Integer())]
    response = [('total', amp.Integer())]

class StopReactor(amp.Command):
    arguments = [('delay', amp.Integer())]
    response = [('status', amp.String())]

# This is the class that tells the RPC exposed methods and their
# implementation
class JustSum(amp.AMP):
    def sum(self, a, b):
        total = a + b
        log.msg('Did a sum: %d + %d = %d' % (a, b, total))
        return {'total': total}
    Sum.responder(sum)

    def stop(self, delay):
        reactor.callLater(delay, reactor.stop)
        return {'status': 'scheduled'}
    StopReactor.responder(stop)

# Various stuff needed to use AMP over stdin/stdout/stderr with a child 
# process
class AMPConnector(protocol.ProcessProtocol):
    def __init__(self, proto, controller):
        self.amp = proto
        self.controller = controller

    def connectionMade(self):
        log.msg("Subprocess started.")
        self.amp.makeConnection(self)
        self.controller.childProcessCreated()

    # Transport
    disconnecting = False

    def write(self, data):
        self.transport.write(data)

    def writeSequence(self, data):
        self.transport.writeSequence(data)

    def loseConnection(self):
        self.transport.loseConnection()

    def getPeer(self):
        return ('omfg what are you talking about',)

    def getHost(self):
        return ('seriously it is a process this makes no sense',)

    def inConnectionLost(self):
        log.msg("Standard in closed")
        protocol.ProcessProtocol.inConnectionLost(self)

    def outConnectionLost(self):
        log.msg("Standard out closed")
        protocol.ProcessProtocol.outConnectionLost(self)

    def errConnectionLost(self):
        log.msg("Standard err closed")
        protocol.ProcessProtocol.errConnectionLost(self)

    def outReceived(self, data):
        self.amp.dataReceived(data)

    def errReceived(self, data):
        log.msg("Received stderr from subprocess: " + repr(data))

    def processEnded(self, status):
        log.msg("Process ended")
        self.amp.connectionLost(status)
        self.controller.childProcessTerminated(status)

# Here you write the code that uses the commands above.
class ProcessController(object):
    def childProcessCreated(self):
        def _cb(result):
            print result
            d = self.child.callRemote(StopReactor, delay=0)
            d.addErrback(lambda _: reactor.stop())
        def _eb(error):
            print error
        d = self.child.callRemote(Sum, a=4, b=5)
        d.addCallback(_cb)
        d.addErrback(_eb)
        
    def childProcessTerminated(self, status):
        print status

    def startProcess(self):
        executable = '/Library/Frameworks/Python.framework/Versions/2.4/bin/python'
        env = os.environ
        env['PYTHONPATH'] = os.pathsep.join(sys.path)
        self.child = JustSum()
        self.connector = AMPConnector(self.child, self)
        
        args = (
            executable,
            '/usr/bin/twistd',
            '--logfile=/Users/dialtone/Projects/python/twist/sub.log',
            '--pidfile=/Users/dialtone/Projects/python/twist/sub.pid',
            '-noy',
            '/Users/dialtone/Projects/python/twist/sub.tac')
        self.process = process.spawnProcess(self.connector, executable, args, env=env)

if __name__ == '__main__':
    p = ProcessController()
    reactor.callWhenRunning(p.startProcess)
    reactor.run()

If you exlude 'boilerplate' code you end up with a ProcessController class
and with the other 3 classes about RPC. Since the father process is using
self.child = JustSum() it will also expose the same API to the child which
will be able to call any of the methods.
The style above may not be immediately obvious to people not used to Twisted
but other than that it's not really hard to abstract a bit more to provide
an API similar to what you described.

HTH

From krstic at solarsail.hcs.harvard.edu  Mon Sep 18 10:06:59 2006
From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=)
Date: Mon, 18 Sep 2006 04:06:59 -0400
Subject: [Python-3000] Kill GIL?
In-Reply-To: <76fd5acf0609180045r41fa6ef1tc78b70228f3c5fe@mail.gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>	
	<450E34EF.3090202@solarsail.hcs.harvard.edu>	
	<6a36e7290609172329s5bf83d2fq911508484e463d2e@mail.gmail.com>	
	<450E4CFE.8080209@solarsail.hcs.harvard.edu>
	<76fd5acf0609180045r41fa6ef1tc78b70228f3c5fe@mail.gmail.com>
Message-ID: <450E53A3.6050309@solarsail.hcs.harvard.edu>

Calvin Spealman wrote:
> So reload the information you want available to worker tasks and pass
> that information along to them, or provide a mechanism for them to
> request it from its preloaded housing. 

With large sets, you can't afford duplicate copies in memory, so there's
nothing to reload. I specifically mentioned providing a mechanism for
retrieving individual pieces of information from the master process, but
if you're doing lots of reads, this introduces complexity and overhead
that's best avoided. Maybe it doesn't matter; with an appropriately nice
interface for it in the distribute-and-coordinate wrapper, we might be
able to hide the complexity from the programmer and use the best
available IPC mechanism in the background to ferry the requests. Sync
domain sockets are certainly fast enough, even though you're again
unnecessarily duplicating parts of memory for each of your workers.

> Alternatively, one could say you dont usually want any preloaded data
> because there is simply too much information to preload and reusable
> worker tasks can provide their own, more effectively targetted caches.

I'm talking about real-world problems, where this most often doesn't work.

> But in the end, you have to realize this is all at a higher level than
> we would really need to even consider for the discussion at hand.

I was answering a direct question.

-- 
Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> | GPG: 0x147C722D

From paul at prescod.net  Mon Sep 18 10:19:40 2006
From: paul at prescod.net (Paul Prescod)
Date: Mon, 18 Sep 2006 01:19:40 -0700
Subject: [Python-3000] Kill GIL?
In-Reply-To: <450E34EF.3090202@solarsail.hcs.harvard.edu>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	<450E34EF.3090202@solarsail.hcs.harvard.edu>
Message-ID: <1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com>

On 9/17/06, Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> wrote:

>
> At present, the Python approach to multi-processing sounds a bit like
> "let's stick our collective hands in the sand and pretend there's no
> problem". In particular, one oft-parroted argument says that it's not
> worth changing or optimizing the language for the few people who can
> afford SMP hardware. In the meantime, dual-core laptops are becoming the
> standard, with Intel predicting quad-core will become mainstream in the
> next few years, and the number of server orders for single-core, UP
> machines is plummeting.


I agree with you Ivan.

Even if I won't contribute code or even a design to the solution (because it
isn't an area of expertise and I'm still working on encodings stuff) I think
that there would be value in saying: "There's a big problem here and we
intend to fix it in Python 3000."

When you state baldly that something is a problem you encourage the
community to experiement with and debate solutions. But I have been in the
audience at Python conferences where the majority opinion was that Python
had no problem around multi-processor apps because you could just roll your
own IPC on top of processes.

If you have to roll your own, that's a problem. If you have to select
between five solutions with really subtle tradeoffs, that's a problem too.

Ivan: why don't you write a PEP about this?



> * Bite the bullet; write and support a stdlib SHM primitive that works
> wherever possible, and simply doesn't work on completely broken
> platforms (I understand Windows falls into this category). Utilize it in
> a lightweight fork-and-coordinate wrapper provided in the stdlib.


Such a low-level approach will not fly. Not just because of Windows but also
because of Jython and IronPython. But maybe I misunderstand it in general.
Python does not really have an abstraction as low-level "memory" and I don't
see why we would want to add it.

* Introduce microthreads, declare that Python endorses Erlang's
> no-sharing approach to concurrency, and incorporate something like
> candygram into the stdlib.
>
> * Introduce a fork-and-coordinate wrapper in the stdlib, and declare
> that we're simply not going to support the use case that requires
> sharing (as opposed to merely passing) objects between processes.


I'm confused on a few levels.

 1. "No sharing" seems to be a feature of both of these options, but the
wording you use to describe it is very different.

 2. You're conflating API and implementation in a manner that is unclear to
me. Why are microthreads important to the Erlang model and what would the
API for fork-and-coordinate look like?

Since you are fired up about this now, would you consider writing a PEP at
least outlining the problem persuasively and championing one of the
(feasible) options? This issue has been discussed for more than a decade and
the artifacts of previous discussions can be quite hard to find.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060918/8e60e281/attachment-0001.html 

From krstic at solarsail.hcs.harvard.edu  Mon Sep 18 11:07:56 2006
From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=)
Date: Mon, 18 Sep 2006 05:07:56 -0400
Subject: [Python-3000] Kill GIL?
In-Reply-To: <1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>	
	<450E34EF.3090202@solarsail.hcs.harvard.edu>
	<1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com>
Message-ID: <450E61EC.40507@solarsail.hcs.harvard.edu>

Paul Prescod wrote:
> I think that there would be value in saying: "There's a
> big problem here and we intend to fix it in Python 3000."

I'm not at all convinced that this is something to be addressed in 3.0.
Py3k is about removing cruft, not adding features; a proper MP system
represents a feature addition that might be more appropriate for 2.6 or
post-3.0 if it gets horribly drawn out.

> Ivan: why don't you write a PEP about this?

I'd like to hear Guido's overarching thoughts on the matter, if any, and
would afterwards be happy to write a PEP.

> Python does not really have an abstraction as low-level
> "memory" and I don't see why we would want to add it.

You don't need a special abstraction; a library adding primitives like
SHMlist and SHMdict would be fully adequate. Arbitrary objects could
decide to react to specific getattr/setattribute calls by peeking and
poking at the primitives. A SHMpickle mechanism could be used to stuff
existing objects into SHM, and then create relevant proxies.

>  1. "No sharing" seems to be a feature of both of these options, but the
> wording you use to describe it is very different.

Erlang's shared-nothing is a conviction: "you don't need to share things
to get good concurrent operation". The alternative I mention is to
declare "well, we recognize the need for shared-something, but are
pointedly not providing the functionality". An irrelevant difference for
the most part.

>  2. You're conflating API and implementation in a manner that is unclear
> to me. Why are microthreads important to the Erlang model

They're not; Candygram proves as much. But the Erlang model was designed
with the idea that threads ("processes") cost almost nothing, and if
threads instead cost at least a full stack allocation, it's easy to get
into hot water.

> and what would
> the API for fork-and-coordinate look like?

I'm not going to try and design an API at 5:05AM. I'll think about this
in the next few days, and stick it in the PEP after Guido chimes in.

> Since you are fired up about this now, would you consider writing a PEP
> at least outlining the problem persuasively and championing one of the
> (feasible) options? 

Sure.

-- 
Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> | GPG: 0x147C722D

From ncoghlan at gmail.com  Mon Sep 18 12:47:08 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Mon, 18 Sep 2006 20:47:08 +1000
Subject: [Python-3000] Kill GIL?
In-Reply-To: <bbaeab100609171103w88e64a6k1a0bad9887b282a4@mail.gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>	
	<450D4AAE.2000805@gmail.com>
	<bbaeab100609171103w88e64a6k1a0bad9887b282a4@mail.gmail.com>
Message-ID: <450E792C.1070105@gmail.com>

Brett Cannon wrote:
> On 9/17/06, *Nick Coghlan* <ncoghlan at gmail.com 
>        - use threads to perform blocking I/O in parallel
>        - use multiple interpreters to perform Python execution in parallel
> 
> 
> Possibly, but as it stands now interpreters just execute in their own 
> Python thread, so there is no real performance boost.  Without the GIL 
> shifting over to per interpreter instead of per process there is going 
> to be the same performance problems as with Python threads.  And 
> changing that would be  hard since objects can be shared between 
> multiple interpreters.

I was thinking it would be easier to split out the Global Interpreter Lock and 
a per-interpreter Local Interpreter Lock, rather than trying to go to a full 
free-threading model. Anyone sharing other objects between interpreters would 
still need their own synchronisation mechanism, but something like 
threading.Queue should suffice for that.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From ncoghlan at gmail.com  Mon Sep 18 13:04:57 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Mon, 18 Sep 2006 21:04:57 +1000
Subject: [Python-3000] Kill GIL?
In-Reply-To: <1158504672.28528.82.camel@fsol>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>	<450D4AAE.2000805@gmail.com>
	<1158504672.28528.82.camel@fsol>
Message-ID: <450E7D59.90706@gmail.com>

Antoine Pitrou wrote:
> Le dimanche 17 septembre 2006 ? 23:16 +1000, Nick Coghlan a ?crit :
>> Brett Cannon's sandboxing work (which aims to provide first-class support for 
>> multiple interpreters in the same process for security reasons) also seems 
>> like a potentially fruitful approach to distributing processing to multiple cores:
>>    - use threads to perform blocking I/O in parallel
> 
> OTOH, the Twisted approach avoids all the delicate synchronization
> issues that arise when using threads to perform concurrent IO tasks.
> 
> Also, IO is by definition not CPU-intensive, so there is no point in
> distributing IO to multiple cores (and it could even cause a small
> decrease in performance because of inter-CPU communication overhead).

Yeah, I made a mistake. The distinction is whether or not the CPU-intensive 
task is written in C/C++ or Python - threads already work fine for the former, 
but something new is needed for the latter (either better IPC support or 
better in-process multi-interpreter support that uses an interpreter-specific 
lock for interpreter-specific data structures, reserving the GIL for shared 
state).

Cheers,
Nick.



-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From talin at acm.org  Mon Sep 18 15:38:35 2006
From: talin at acm.org (Talin)
Date: Mon, 18 Sep 2006 06:38:35 -0700
Subject: [Python-3000] Ruminations on switch, constants, imports, etc.
Message-ID: <450EA15B.1050302@acm.org>

"Ruminating" is the best word I can think of here - I've been slowly 
digesting the ideas and discussions over the last couple months. Part of 
the reason why all this is relevant to me is that I'm working on a 
couple of side projects, some of which involve "mini-languages" that 
have similar issues to what has been discussed on the list.

Bear in mind that none of what I say here is a recommendation of where 
Python *should* go - rather, it's a description of where I have 
*already* gone in some of the work that I am doing. It is merely one 
possible answer out of many to the suggestions that have been put 
forward in this forum.

(And if this is merely a rehash of something that was discussed long 
ago, I apologize.)

I'll start with the 'switch' discussion (you'll note that the ideas here 
cut across a bunch of different threads.) The controversy, for which 
there seemed to be no final resolution, had to do with when the case 
values should be evaluated (I won't repeat the descriptions of the 
various schools - look in the PEP.)

As some people pointed out, the truly fundamental issue has to do with 
the nature of constants in Python, which is to say that currently, there 
are none - other than literals.

What would be entailed in adding a const type? Without fundamentally 
changing the language, the best you can do is early binding, in other 
words the const value isn't known at compilation time, but is a variable 
that is frozen at some point - perhaps at module load time.

Adding a true const - that is, a variable whose value is known to the 
compiler, and can be optimized as such - is somewhat more involved. For 
one thing, the compiler knows nothing about external modules, or 
anything outside the current compilation unit. Without the ability to 
import external definitions, a compile-time 'const' is quite useless.

One way around this (which is a little kludgey) would be to add a second 
type of 'import' - a compile-time import, which could be called 
something like 'include' or 'using'.

An import of this type would act as if it had been textually included in 
the current source file. It would become part of the current compilation 
unit, and it would have the same restrictions - such as the inability to 
access variables imported via 'import' at compile time.

Include files can of course include other files - but they can also 
'import' as well. The effect of the importing from an include is the 
same as importing from the primary source file (because of the rule 
which states that 'include' is equivalent to textual inclusion.)

Conversely, imported files can include - however the effect of the 
inclusion is limited to the imported file only, and does not affect the 
primary source file (because it's a different compilation unit.)

This implies that you can't access constants that are within an imported 
module (because the constant definitions exist only within the compiler 
- they are transformed into literals before code generation occurs.)

If a source file and an included module need to share constant values, 
they must each include the definitions of those constants. This can lead 
to problems if the include files are changing - one file might be 
compiled with a different version of the include file than another. 
OTOH, many potential uses for constants would be for things like 
operating system error codes and flags, which are fairly stable and 
unchanging -- so even a restricted use of the facility (i.e. don't use 
it for values which are in flux) might be worth while. Another 
possibility is to embed include checksums or other version info within 
the compiled file.

So far, it seems like a lot of added complexity for fairly little 
benefit. However, where it gets interesting is when you realize that 
once you've given the compiler knowledge of the world outside a single 
compilation unit, a number of interesting possibilities arise.

The one I've been experimenting with in my mini-language is macros. Not 
macros in the C sense, but in the lisp sense - a function which takes 
unevaluated arguments. Actually, they more closely resemble Dylan 
macros, in that they add production rules to the parser. Internally a 
macro is a small snippet of an AST, which gets spliced into the AST of 
the calling function at compile time. Macro arguments can be 
identifiers, expressions, and statements, all of which get substituted 
into the appropriate point in the AST.

This is of course going way past Guido's "Programmable Syntax" 
prohibition. Good thing I am only talking hypothetically!

It seems to me, however, (getting back to 'switch') that supporting a 
proper 'switch' statement has to address these issues in *some* fashion 
- the issue of constants isn't going to go away completely under any of 
the proposed approaches.

In fact, I'm actually leaning towards the position of *not* adding a 
switch statement to Python, simply because I'm not sure that Python 
*should* deal with all of these issues. It seems to me that adding 
'const' to the language opens up a Pandora's box, containing both chaos 
and hope - and I think that for some language X, it may be a good idea 
to open that box, but I don't know if X includes Python.

-- Talin


From rhamph at gmail.com  Mon Sep 18 17:48:56 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Mon, 18 Sep 2006 09:48:56 -0600
Subject: [Python-3000] Delayed reference counting idea
Message-ID: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>

I think all the attempts to expose GIL-less semantics to python code
miss the point.  Reference counting turns all references into
modifications.  You can't avoid the GIL without first changing
reference counting.

There's a few ways to approach this:
* atomic INCREF/DECREF using cpu instructions.  This would be very
expensive, considering how often we do it.
* Bolt-on tracing GC such as Boehm-Demers-Weiser.  Totally unsupported
by the C standards and changes cache characteristics that CPython has
been designed with for years, likely with a very large performance
penalty.
* Tracing GC within C.  Would require rewriting every API in CPython,
as well as the code that uses them.  Alternative implementations
(PyPy, et al) can try this, but I think it's clear that it's not worth
the effort for CPython, especially given the performance risks.
* Delayed reference counting (save 10 or 20 INCREF/DECREF ops to a
buffer, then flush them all at once).  In theory, it would retain the
cache locality while amortizing locking needed for SMP machines.

For the most part delayed reference counting should require no
changes, since it would use the existing INCREF/DECREF API.  Some code
does circumvent that API, and would need to be changed.

Anyway, my point is that, for those of you out there who want to
remove the GIL, here is something you really can experiment with.
Even if there was a 20% performance drop on real-world tests you could
still make it a configure option, enabled only for people who need
many CPUs.  (I've tried it myself, but never got past the weird
crashes.  Probably missed something silly).

-- 
Adam Olsen, aka Rhamphoryncus

From jimjjewett at gmail.com  Mon Sep 18 17:56:51 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 18 Sep 2006 11:56:51 -0400
Subject: [Python-3000] Kill GIL? - to PEP 3099?
Message-ID: <fb6fbf560609180856s1b63ded2o66061d453368a9@mail.gmail.com>

Guido -- If I'm not mis-stating, this might be a candidate for PEP 3099.

On 9/18/06, Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> wrote:
> Paul Prescod wrote:

> > Ivan: why don't you write a PEP about this?

> I'd like to hear Guido's overarching thoughts on the matter, if any, and
> would afterwards be happy to write a PEP.

IIRC, his most recent statements boiled down to:

(1)  The GIL works well enough, most of the time.
(2)  Taking it out is harder than people realize.
(3)  Therefore, he won't spend too much time rethinking unless/until
there is code to evaluate.

-jJ

From phd at phd.pp.ru  Mon Sep 18 18:02:33 2006
From: phd at phd.pp.ru (Oleg Broytmann)
Date: Mon, 18 Sep 2006 20:02:33 +0400
Subject: [Python-3000] Kill GIL? - to PEP 3099?
In-Reply-To: <fb6fbf560609180856s1b63ded2o66061d453368a9@mail.gmail.com>
References: <fb6fbf560609180856s1b63ded2o66061d453368a9@mail.gmail.com>
Message-ID: <20060918160232.GB30336@phd.pp.ru>

On Mon, Sep 18, 2006 at 11:56:51AM -0400, Jim Jewett wrote:
> IIRC, his most recent statements boiled down to:
> 
> (1)  The GIL works well enough, most of the time.

1a. On multiprocessor/multicore systems use processes, not threads.

> (2)  Taking it out is harder than people realize.
> (3)  Therefore, he won't spend too much time rethinking unless/until
> there is code to evaluate.

Oleg.
-- 
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.

From qrczak at knm.org.pl  Mon Sep 18 18:27:12 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Mon, 18 Sep 2006 18:27:12 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	(Adam Olsen's message of "Mon, 18 Sep 2006 09:48:56 -0600")
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
Message-ID: <8764fl87j3.fsf@qrnik.zagroda>

"Adam Olsen" <rhamph at gmail.com> writes:

> * Bolt-on tracing GC such as Boehm-Demers-Weiser.  Totally unsupported
> by the C standards and changes cache characteristics that CPython has
> been designed with for years, likely with a very large performance
> penalty.

Last time I did some GC benchmarks (unrelated to Python), Boehm GC
came up surprisingly fast. I suppose it's faster than malloc +
reference counting (not sure how much amortizing malloc calls helps).

I don't like the idea of a conservative GC at all in general, but
Boehm GC seems to have very good quality, and it's easy to use from
the point of view of a C API.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From paul at prescod.net  Mon Sep 18 18:45:52 2006
From: paul at prescod.net (Paul Prescod)
Date: Mon, 18 Sep 2006 09:45:52 -0700
Subject: [Python-3000] Kill GIL? - to PEP 3099?
In-Reply-To: <fb6fbf560609180856s1b63ded2o66061d453368a9@mail.gmail.com>
References: <fb6fbf560609180856s1b63ded2o66061d453368a9@mail.gmail.com>
Message-ID: <1cb725390609180945r64f2cfafv33a801abe42452b4@mail.gmail.com>

The thread subject notwithstanding, the majority of the discussion was about
ways to work around the GIL, not remove it. Therefore the thing you might
put in PEP 3099 is not the thing under active discussion.

On 9/18/06, Jim Jewett <jimjjewett at gmail.com> wrote:
>
> Guido -- If I'm not mis-stating, this might be a candidate for PEP 3099.
>
> On 9/18/06, Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> wrote:
> > Paul Prescod wrote:
>
> > > Ivan: why don't you write a PEP about this?
>
> > I'd like to hear Guido's overarching thoughts on the matter, if any, and
> > would afterwards be happy to write a PEP.
>
> IIRC, his most recent statements boiled down to:
>
> (1)  The GIL works well enough, most of the time.
> (2)  Taking it out is harder than people realize.
> (3)  Therefore, he won't spend too much time rethinking unless/until
> there is code to evaluate.
>
> -jJ
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060918/fb9aa398/attachment.htm 

From rhamph at gmail.com  Mon Sep 18 19:11:51 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Mon, 18 Sep 2006 11:11:51 -0600
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <8764fl87j3.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
Message-ID: <aac2c7cb0609181011x28c1b639weef05f356f75d9b6@mail.gmail.com>

On 9/18/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
> "Adam Olsen" <rhamph at gmail.com> writes:
>
> > * Bolt-on tracing GC such as Boehm-Demers-Weiser.  Totally unsupported
> > by the C standards and changes cache characteristics that CPython has
> > been designed with for years, likely with a very large performance
> > penalty.
>
> Last time I did some GC benchmarks (unrelated to Python), Boehm GC
> came up surprisingly fast. I suppose it's faster than malloc +
> reference counting (not sure how much amortizing malloc calls helps).

I expect Boehm would do very well in applications suited for it.  I
just don't think that includes CPython, especially with all the
third-party C libraries.


-- 
Adam Olsen, aka Rhamphoryncus

From barry at python.org  Mon Sep 18 19:40:19 2006
From: barry at python.org (Barry Warsaw)
Date: Mon, 18 Sep 2006 13:40:19 -0400
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <8764fl87j3.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
Message-ID: <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sep 18, 2006, at 12:27 PM, Marcin 'Qrczak' Kowalczyk wrote:

> "Adam Olsen" <rhamph at gmail.com> writes:
>
>> * Bolt-on tracing GC such as Boehm-Demers-Weiser.  Totally  
>> unsupported
>> by the C standards and changes cache characteristics that CPython has
>> been designed with for years, likely with a very large performance
>> penalty.
>
> Last time I did some GC benchmarks (unrelated to Python), Boehm GC
> came up surprisingly fast. I suppose it's faster than malloc +
> reference counting (not sure how much amortizing malloc calls helps).
>
> I don't like the idea of a conservative GC at all in general, but
> Boehm GC seems to have very good quality, and it's easy to use from
> the point of view of a C API.

What worries me is the unpredictability of gc vs. refcounting.  For  
some class of Python applications it's important that when an object  
is dereferenced it really goes away right then.  I /like/ reference  
counting!

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQCVAwUBRQ7aCHEjvBPtnXfVAQLziwP+K/lepARPfrRtGoH/7HTUE6oXL+4kF5Ow
fEmg7zRPL3p8vrPrdKZi63kW4pZWYbmlsb/ugF+WmSdJIYebdK/p5d4kq5uOcWKi
9qVLtVXo6/f/nsNEeN0pcX/Y5RTRXPSgMy7hwlDH7/x4gT+Rz6uZSCR1I02x5OHa
wN4+KiInPSw=
=ScRh
-----END PGP SIGNATURE-----

From solipsis at pitrou.net  Mon Sep 18 20:38:00 2006
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Mon, 18 Sep 2006 20:38:00 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
Message-ID: <1158604680.4726.9.camel@fsol>


Le lundi 18 septembre 2006 ? 09:48 -0600, Adam Olsen a ?crit :
> * Bolt-on tracing GC such as Boehm-Demers-Weiser.  Totally unsupported
> by the C standards and changes cache characteristics that CPython has
> been designed with for years, likely with a very large performance
> penalty.

Has it been measured what cache effects reference counting entails ?

With reference counting, each object is mutable from the point of view
of the CPU cache (refcnt is always incremented and later decremented).
This means almost every cache line containing Python objects - including
functions, modules... - has to be written back when it is evicted, even
if those objects are "constant".

> * Delayed reference counting (save 10 or 20 INCREF/DECREF ops to a
> buffer, then flush them all at once).  In theory, it would retain the
> cache locality while amortizing locking needed for SMP machines.

You would have to lock the buffer, wouldn't you?
Unless you use per-CPU buffers.




From meyer at acm.org  Mon Sep 18 21:18:29 2006
From: meyer at acm.org (Andre Meyer)
Date: Mon, 18 Sep 2006 21:18:29 +0200
Subject: [Python-3000] Kill GIL?
In-Reply-To: <450E4059.1000806@v.loewis.de>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	<450E4059.1000806@v.loewis.de>
Message-ID: <7008329d0609181218v64ca1465s14cefe3cc91b67a8@mail.gmail.com>

I would love to contribute code for this problem. Unfortunately, I am not
able to do so, but see the problem for myself and others. Therefore, I
wanted to raise the question in time for Py3k. The number of responses
indicates that it is not just me who struggles and there are people who know
how to improve the situation. So far, this seems like a fruitful discussion.

Thanks
Andre


On 9/18/06, "Martin v. L?wis" <martin at v.loewis.de> wrote:
>
> Andre Meyer schrieb:
> > While I understand the difficulties in removing the GIL and the
> > potential negative effect on single-threaded applications I would very
> > much encourage discussion to seriously consider removing the GIL (maybe
> > optionally) in Py3k. If not, what alternatives would you suggest?
>
> Encouraging "very much" is probably not good enough to make anything
> happen. Actual code contributions may, as may offering a bounty
> (although it probably depends on the size of the bounty whether anybody
> wants to collect it).
>
> The alternatives are very straight-forward:
> 1. use Python the same way as you did for Python 2.x. I.e. create
>    many threads, and have only one of them run. Use the other processors
>    for something else, or don't use them at all.
> 2. use Python the same way as many other people do. Don't use threads,
>    instead use multiple processors, and some sort of IPC.
> 3. don't use Python, at least not for the activities that need to
>    run on multiple processors.
> If you want to fully use your multiple processors, depending on the
> application, I'd typically go with option 2 or 3. Option 2 if the code
> to parallelize is written in Python, option 3 if it is written in C
> (yes, you can use multiple truly concurrent threads in Python: just
> release the GIL on the C level; you can't make any calls into Python
> until you reacquire the GIL).
>
> Regards,
> Martin
>



-- 
Dr. Andre P. Meyer                        http://python.openspace.nl/meyer
TNO Defence, Security and Safety          http://www.tno.nl/
Delft Cooperation on Intelligent Systems  http://www.decis.nl/

Ah, this is obviously some strange usage of the word 'safe' that I wasn't
previously aware of. - Douglas Adams
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060918/32b0dfae/attachment.htm 

From jimjjewett at gmail.com  Mon Sep 18 21:27:02 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 18 Sep 2006 15:27:02 -0400
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <1158604680.4726.9.camel@fsol>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<1158604680.4726.9.camel@fsol>
Message-ID: <fb6fbf560609181227w2066144r7c60f436ad48ff0a@mail.gmail.com>

On 9/18/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
>
> Le lundi 18 septembre 2006 ? 09:48 -0600, Adam Olsen a ?crit :
> > * Bolt-on tracing GC such as Boehm-Demers-Weiser.  Totally unsupported
> > by the C standards and changes cache characteristics that CPython has
> > been designed with for years, likely with a very large performance
> > penalty.

> Has it been measured what cache effects reference counting entails ?

Probably not recently.

> With reference counting, each object is mutable from the point of view
> of the CPU cache (refcnt is always incremented and later decremented).

But each object request is only to one piece of memory, not two (obj
and header separate).

Just a reminder about Neil Schemenauer's (old) patch to use Boehm-Demers

http://mail.python.org/pipermail/python-list/1999-July/thread.html#7638
http://arctrix.com/nas/python/gc/
http://people.csail.mit.edu/gregs/ll1-discuss-archive-html/threads.html#00056

According to http://codespeak.net/pypy/dist/pypy/doc/getting-started.html
PyPy sometimes translates to the use of BDW.

I also seem to remember (but can't find a reference) that someone
tried using a separate immortal namespace for basic objects like None,
but the hassle of deciding what to do on each object ate up the
savings.

-jJ

From rhettinger at ewtllc.com  Mon Sep 18 21:56:27 2006
From: rhettinger at ewtllc.com (Raymond Hettinger)
Date: Mon, 18 Sep 2006 12:56:27 -0700
Subject: [Python-3000] Delayed reference counting idea
Message-ID: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>

[Adam Olsen]
> I don't like the idea of a conservative GC at all in general, but
> Boehm GC seems to have very good quality, and it's easy to use from
> the point of view of a C API.

Several thoughts:

* An easier C API would significantly benefit the language in terms of
more extensions being available and in terms of increased reliability
for those extensions.  The current refcount scheme results in pervasive
refleak bugs and subsequent, interminable bughunts.  It adds to code
verbosity/complexity and makes it tricky for beginning extension writers
to get their first apps done correctly.  IOW, I agree that GC without
refcounts will make it easier to write good C code.

* I doubt the anecdotal comments about Boehm GC with respect to
performance.  It may be better or it may be worse.  While I think the
latter is more likely, only an implementation patch will tell the tale.

* At my company, we write real-time apps that benefit from the current
refcounting scheme.  We would have to stick with Py2.x unless Boehm GC
can be implemented without periodically killing responsiveness.



[Barry Warsaw]
> What worries me is the unpredictability of gc vs. refcounting.
> For some class of Python applications it's important that when
> an object is dereferenced it really goes away right then.  
> I /like/ reference counting!

No doubt that those exist; however, that sort of design is somewhat
fragile and bugprone leading to endless sessions to find-out who or what
is keeping an object alive.  This situation can only get worse when
new-style classes become the norm.  Also, IIRC, bugs involving __del__
have been one of the more complex, buggy, and dark corners of the
language.  Statistics incontrovertibly prove that people who habitually
avoid __del__ lead happier lives and spend fewer hours in therapy ;-)


Raymond 

From rhamph at gmail.com  Mon Sep 18 21:59:35 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Mon, 18 Sep 2006 13:59:35 -0600
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <1158604680.4726.9.camel@fsol>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<1158604680.4726.9.camel@fsol>
Message-ID: <aac2c7cb0609181259s21468f0ev88a8f78815340264@mail.gmail.com>

On 9/18/06, Antoine Pitrou <solipsis at pitrou.net> wrote:
>
> Le lundi 18 septembre 2006 ? 09:48 -0600, Adam Olsen a ?crit :
> > * Bolt-on tracing GC such as Boehm-Demers-Weiser.  Totally unsupported
> > by the C standards and changes cache characteristics that CPython has
> > been designed with for years, likely with a very large performance
> > penalty.
>
> Has it been measured what cache effects reference counting entails ?
>
> With reference counting, each object is mutable from the point of view
> of the CPU cache (refcnt is always incremented and later decremented).
> This means almost every cache line containing Python objects - including
> functions, modules... - has to be written back when it is evicted, even
> if those objects are "constant".

I don't think there's ever been any measuring, just theorizing based
on some general benchmarks.  For example, it's likely that the cache
line containing the refcount is likely already loaded, when the type
point is loaded.

However, delayed reference counting could allow you to remove
incref/decref pairs, thereby avoiding the write entierly in some
cases.


>
> > * Delayed reference counting (save 10 or 20 INCREF/DECREF ops to a
> > buffer, then flush them all at once).  In theory, it would retain the
> > cache locality while amortizing locking needed for SMP machines.
>
> You would have to lock the buffer, wouldn't you?
> Unless you use per-CPU buffers.

I'm assuming per-CPU buffers.  You'd need a global lock to flush them.
 There's probably some more creative schemes, but they couldn't be
implemented quite so simply.


-- 
Adam Olsen, aka Rhamphoryncus

From barry at python.org  Mon Sep 18 22:15:47 2006
From: barry at python.org (Barry Warsaw)
Date: Mon, 18 Sep 2006 16:15:47 -0400
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
Message-ID: <8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sep 18, 2006, at 3:56 PM, Raymond Hettinger wrote:

> * An easier C API would significantly benefit the language in terms of
> more extensions being available and in terms of increased reliability
> for those extensions.  The current refcount scheme results in  
> pervasive
> refleak bugs and subsequent, interminable bughunts.  It adds to code
> verbosity/complexity and makes it tricky for beginning extension  
> writers
> to get their first apps done correctly.  IOW, I agree that GC without
> refcounts will make it easier to write good C code.
>
> * I doubt the anecdotal comments about Boehm GC with respect to
> performance.  It may be better or it may be worse.  While I think the
> latter is more likely, only an implementation patch will tell the  
> tale.
>
> * At my company, we write real-time apps that benefit from the current
> refcounting scheme.  We would have to stick with Py2.x unless Boehm GC
> can be implemented without periodically killing responsiveness.

We'd be in the same boat.  While I agree with Raymond that it can be  
quite difficult to get C code to be refcount-correct, I wonder if  
there aren't tools or other debugging aids we can develop that will  
at least help debug when problems occur.  Not that I have any bright  
ideas here, but as an example, one of the things we do when our app  
exits (it's potentially long running, but never daemonic) is to  
stroll through the list of all live objects, checking their refcounts  
against expected values.  Of course we only do this in debug builds,  
but right now in our dev tree I'm looking at an issue where a central  
object has a few hundred more refcounts than expected at program exit.

The really tricky thing about refcounting is making sure all the exit  
conditions out of a function are refcount correct.  Usually these  
involve error or exception conditions, and they can be a bear to get  
right.  Make you want to write the goto-considered-useful rant all  
over again. :)  Would a garbage collection interface make this easier  
(because you could ignore all that) or would you be trading that off  
for things like gcpro in Emacs, which can be just as harmful if you  
screw then up?

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQCVAwUBRQ7+eHEjvBPtnXfVAQKV8QQAkmgd7XEHIyNKRi25LyG+WB9KX9lXsucc
dg/1BUNpkAPjyK6jXrXKpSvQtMzfCkPSyRENSy/B/bjom1TRcSPpmQWiFeT73MYm
aRgma8L5ahuZkGdu9MaAr9LUCNW4VsPMPJCRBB0vlpkPaaDvgyoCIFpL1SjbRako
hh+HAMuEHHY=
=YqgR
-----END PGP SIGNATURE-----

From jimjjewett at gmail.com  Mon Sep 18 22:33:09 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 18 Sep 2006 16:33:09 -0400
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
	<8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org>
Message-ID: <fb6fbf560609181333p32171798q398a4f489f176e82@mail.gmail.com>

On 9/18/06, Barry Warsaw <barry at python.org> wrote:

>  ... I agree with Raymond that it can be quite difficult to get
> C code to be refcount-correct, ...

How much of this (particularly for beginners) is remembering the
refcount affects of standard functions?  Could this be avoided by just
always using the more abstract interface?  (Sequence instead of List,
Mapping instead of Dict)

> The really tricky thing about refcounting is making sure all the exit
> conditions out of a function are refcount correct.  Usually these
> involve error or exception conditions, and they can be a bear to get
> right.

Would it solve this problem if there were a PyTEMPREF that magically
treated the refcount as an automatic variable?  (It increfed
immediately, and decrefed whenever the function exited, without the
user having to track this manually.)

Would it help enough to justify a pre-processing requirement?

-jJ

From rhamph at gmail.com  Mon Sep 18 22:34:44 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Mon, 18 Sep 2006 14:34:44 -0600
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
Message-ID: <aac2c7cb0609181334n36758748t9cfb4864bf7d9611@mail.gmail.com>

On 9/18/06, Raymond Hettinger <rhettinger at ewtllc.com> wrote:
> [Adam Olsen]
> > I don't like the idea of a conservative GC at all in general, but
> > Boehm GC seems to have very good quality, and it's easy to use from
> > the point of view of a C API.

This was Marcin, not me ;)


> Several thoughts:
>
> * An easier C API would significantly benefit the language in terms of
> more extensions being available and in terms of increased reliability
> for those extensions.  The current refcount scheme results in pervasive
> refleak bugs and subsequent, interminable bughunts.  It adds to code
> verbosity/complexity and makes it tricky for beginning extension writers
> to get their first apps done correctly.  IOW, I agree that GC without
> refcounts will make it easier to write good C code.
>
> * I doubt the anecdotal comments about Boehm GC with respect to
> performance.  It may be better or it may be worse.  While I think the
> latter is more likely, only an implementation patch will tell the tale.

I have played with it before, on the CPython codebase.  I really can't
imagine it getting more than a minor speed boost, or else we'd already
be finding that refcounting was taking up a large portion of our CPU
time.  (Anybody have actual numbers on the time spend in malloc/free?)

The real advantage of Boehm is with threading.  Avoiding the locking
means you don't get the giant penalty you'd otherwise get.  Still not
inherently faster than a single-threaded program (which needs no
locking).

I discount Boehm because of the complexity and non-standardness
though.  I'd never want to maintain it, especially since it would
effect all the libraries we link to as well.  Although, with suitable
proxying, it may be possible to limit it to just Python objects..

If I was to seriously consider a python implementation with a tracing
GC, I'd want it to be a moving GC, to fix the high-water mark problem
of malloc.  That seems incompatible with conservative GCs such as
Boehm, although, come to think of it, I could do it using
standard-conforming C (if any API rewrite were permissible).


> * At my company, we write real-time apps that benefit from the current
> refcounting scheme.  We would have to stick with Py2.x unless Boehm GC
> can be implemented without periodically killing responsiveness.

Boehm does have options for incremental GC.


> [Barry Warsaw]
> > What worries me is the unpredictability of gc vs. refcounting.
> > For some class of Python applications it's important that when
> > an object is dereferenced it really goes away right then.
> > I /like/ reference counting!
>
> No doubt that those exist; however, that sort of design is somewhat
> fragile and bugprone leading to endless sessions to find-out who or what
> is keeping an object alive.  This situation can only get worse when
> new-style classes become the norm.  Also, IIRC, bugs involving __del__
> have been one of the more complex, buggy, and dark corners of the
> language.  Statistics incontrovertibly prove that people who habitually
> avoid __del__ lead happier lives and spend fewer hours in therapy ;-)

I agree here.  I think an executor approach is much better; kill the
object, then make a weakref callback do any further cleanups using
copies it made in advance.


-- 
Adam Olsen, aka Rhamphoryncus

From ronaldoussoren at mac.com  Mon Sep 18 22:45:35 2006
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Mon, 18 Sep 2006 22:45:35 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
Message-ID: <BF6C3E46-E388-45F9-8E4D-69C10979CEC0@mac.com>


On Sep 18, 2006, at 9:56 PM, Raymond Hettinger wrote:

>
> * I doubt the anecdotal comments about Boehm GC with respect to
> performance.  It may be better or it may be worse.  While I think the
> latter is more likely, only an implementation patch will tell the  
> tale.

hear, hear ;-). Other anecdotical evidence says that a GC can be  
significantly faster than manual allocation, especially a copying  
collector where allocation can be really, really cheap. Boehm's GC  
isn't a copying collector, but I wouldn't count it out just because  
"everybody knows that GC is slow".

I'd be more worried about changes in semantics, it's pretty  
convenient to write 'open(somefile, 'r').read()' to read a file in  
bulk, currently this will immediately close the file but with a GC  
system it may be a long time before the file is actually closed.

Another reason to be scared of GC is some bad experience I've had  
with Java's GC, its rather annoying if you're a sysadmin, get a Java  
app thrown over the wall and then have to tweak obscure GC-related  
parameters to get decent performance (or rather, an application that  
doesn't crash after running for a couple of days). That may have been  
bad code in the application, but I'm not entirely convinced that  
Java's GC doesn't deserve to get some of the blame.

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2157 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20060918/a9855a76/attachment.bin 

From barry at python.org  Mon Sep 18 22:56:19 2006
From: barry at python.org (Barry Warsaw)
Date: Mon, 18 Sep 2006 16:56:19 -0400
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <fb6fbf560609181333p32171798q398a4f489f176e82@mail.gmail.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
	<8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org>
	<fb6fbf560609181333p32171798q398a4f489f176e82@mail.gmail.com>
Message-ID: <F53B4026-089C-4EFA-9C34-68A161D612DF@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sep 18, 2006, at 4:33 PM, Jim Jewett wrote:

> On 9/18/06, Barry Warsaw <barry at python.org> wrote:
>
>>  ... I agree with Raymond that it can be quite difficult to get
>> C code to be refcount-correct, ...
>
> How much of this (particularly for beginners) is remembering the
> refcount affects of standard functions?  Could this be avoided by just
> always using the more abstract interface?  (Sequence instead of List,
> Mapping instead of Dict)

I think that may be part of it (I've mentioned this before), but our  
C API code wasn't written by beginners, and while we don't have any  
known refcounting problems in production code, during development one  
or two can slip through.  I don't think that the above is the major  
contributor.

>> The really tricky thing about refcounting is making sure all the exit
>> conditions out of a function are refcount correct.  Usually these
>> involve error or exception conditions, and they can be a bear to get
>> right.
>
> Would it solve this problem if there were a PyTEMPREF that magically
> treated the refcount as an automatic variable?  (It increfed
> immediately, and decrefed whenever the function exited, without the
> user having to track this manually.)
>
> Would it help enough to justify a pre-processing requirement?

I don't know, I hate macros. :)

<talking from="my ass">
It's been a long while since I programmed on the NeXT, so Mac folks  
here please chime in, but isn't there some Foundation idiom where  
temporary Objective-C objects didn't need to be explicitly released  
if their lifetime was exactly the duration of the function in which  
they were created?  ISTR something like the main event loop tracking  
such refcount=1 objects and deleting them automatically the next time  
through the loop.  Since Python has a main loop, I wonder if the same  
kind of trick couldn't be done here.

IOW, if you're just creating an object temporarily, you never need to  
explicitly decref it because the main eval loop would do it for you.   
Dang I wish I could remember the details.
</talking>

Something like that, where you didn't have to track all objects  
through all exit conditions would probably help.
- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQCVAwUBRQ8H83EjvBPtnXfVAQIz5wP+JUJF3fwYIZ6fUmG4PkpyE8K+oOflCQYE
vjBSa4vaCkX8fJvAZzwH5VgFoOEJ6WxLwagkJvFmVdCLDNgs2TwJF+cT45qJYCLF
cWbcNAtesxMVZIUMjtUDpQLoSw/1CTuGbCdymqEuteF8IRZEJP5Usv1c6ytS5LJK
cuLWyArvNeo=
=UDIj
-----END PGP SIGNATURE-----

From martin at v.loewis.de  Mon Sep 18 23:11:03 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 18 Sep 2006 23:11:03 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <1158604680.4726.9.camel@fsol>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<1158604680.4726.9.camel@fsol>
Message-ID: <450F0B67.2030405@v.loewis.de>

Antoine Pitrou schrieb:
> Has it been measured what cache effects reference counting entails ?

I don't think so.

> With reference counting, each object is mutable from the point of view
> of the CPU cache (refcnt is always incremented and later decremented).
> This means almost every cache line containing Python objects - including
> functions, modules... - has to be written back when it is evicted, even
> if those objects are "constant".

Yes, though this is likely negligible wrt. to the overhead that locking
operations on refcount changes would have.

Regards,
Martin

From martin at v.loewis.de  Mon Sep 18 23:16:01 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 18 Sep 2006 23:16:01 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
Message-ID: <450F0C91.3090608@v.loewis.de>

Raymond Hettinger schrieb:
> * An easier C API would significantly benefit the language in terms of
> more extensions being available and in terms of increased reliability
> for those extensions.  The current refcount scheme results in pervasive
> refleak bugs and subsequent, interminable bughunts.  It adds to code
> verbosity/complexity and makes it tricky for beginning extension writers
> to get their first apps done correctly.  IOW, I agree that GC without
> refcounts will make it easier to write good C code.

I don't think this will be the case. A garbage collector will likely
need to find out what the pointer local and global variables are,
as well as the pointers hidden in C structures (at least if the
collector is going to be "precise").

So I think a Python with "true" GC will be much more error-prone
on the C level, with authors not getting the declarations of
variables right, and endless bug hunts because a referenced object
is already collected, and its memory overwritten.

Regards,
Martin

From martin at v.loewis.de  Mon Sep 18 23:19:05 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 18 Sep 2006 23:19:05 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <BF6C3E46-E388-45F9-8E4D-69C10979CEC0@mac.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
	<BF6C3E46-E388-45F9-8E4D-69C10979CEC0@mac.com>
Message-ID: <450F0D49.4090606@v.loewis.de>

Ronald Oussoren schrieb:
> hear, hear ;-). Other anecdotical evidence says that a GC can be
> significantly faster than manual allocation, especially a copying
> collector where allocation can be really, really cheap.

OTOH, it isn't typically faster than obmalloc (which also allocates
in constant time "on average").

Regards,
Martin

From rhettinger at ewtllc.com  Tue Sep 19 03:21:44 2006
From: rhettinger at ewtllc.com (Raymond Hettinger)
Date: Mon, 18 Sep 2006 18:21:44 -0700
Subject: [Python-3000] Delayed reference counting idea
Message-ID: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4696@hammer.office.bhtrader.com>

[Raymond Hettinger]
>> * At my company, we write real-time apps that benefit from the
current
>> refcounting scheme.  We would have to stick with Py2.x unless Boehm
GC
>> can be implemented without periodically killing responsiveness.

[Jim Jewett]
>Do you effectively turn off cyclic collections, (but refcount 
> reclaims enough) or is the current cyclic collector fast enough?

We turn-off GC and code carefully.



Raymond

From greg.ewing at canterbury.ac.nz  Tue Sep 19 06:01:10 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 19 Sep 2006 16:01:10 +1200
Subject: [Python-3000] Kill GIL?
In-Reply-To: <450E792C.1070105@gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	<450D4AAE.2000805@gmail.com>
	<bbaeab100609171103w88e64a6k1a0bad9887b282a4@mail.gmail.com>
	<450E792C.1070105@gmail.com>
Message-ID: <450F6B86.5050902@canterbury.ac.nz>

Nick Coghlan wrote:

> I was thinking it would be easier to split out the Global Interpreter Lock and 
> a per-interpreter Local Interpreter Lock, rather than trying to go to a full 
> free-threading model. Anyone sharing other objects between interpreters would 
> still need their own synchronisation mechanism, but something like 
> threading.Queue should suffice for that.

I don't think that using an ordinary Queue object would
suffice for that, because it's designed on the assumption
that basic refcounting etc. is already protected by a GIL.

If nothing else, you'd need some kind of extra locking
mechanism to manage the refcount of the Queue object itself.

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From greg.ewing at canterbury.ac.nz  Tue Sep 19 06:52:22 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 19 Sep 2006 16:52:22 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
Message-ID: <450F7786.3060501@canterbury.ac.nz>

Adam Olsen wrote:
> Reference counting turns all references into
> modifications.
> 
> There's a few ways to approach this:

I've just thought of another one: Instead of a single
refcount per object, each thread has its own set of
refcounts. Then the object has a count of the number
of threads that currently have nonzero refcounts for
it.

Most refcount operations would only affect the
thread's local refcount for the object. Only when that
reached zero would you need to lock the object and
update the global refcount.

Not sure what kind of data structure you'd use for
this, though...

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From greg.ewing at canterbury.ac.nz  Tue Sep 19 07:20:00 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 19 Sep 2006 17:20:00 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
Message-ID: <450F7E00.7000304@canterbury.ac.nz>

Raymond Hettinger wrote:

> * An easier C API would significantly benefit the language in terms of
> more extensions being available and in terms of increased reliability
> for those extensions.  The current refcount scheme results in pervasive
> refleak bugs and subsequent, interminable bughunts.

It's not clear that a different scheme would be much
different, though. If it's not refcounting, there will
be some other set of rules that must be followed, with
equally obscure bugs if you slip up.

Also, at least half of the boilerplate is due to the
necessity of checking for errors at each step. A
different GC scheme wouldn't help with that.

IMO the only way to make writing C extensions truly
straightforward and non-error-prone is to use some
kind of code generation tool like Pyrex. And then
it doesn't matter how complicated the rules are.

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From greg.ewing at canterbury.ac.nz  Tue Sep 19 07:27:18 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 19 Sep 2006 17:27:18 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <fb6fbf560609181333p32171798q398a4f489f176e82@mail.gmail.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
	<8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org>
	<fb6fbf560609181333p32171798q398a4f489f176e82@mail.gmail.com>
Message-ID: <450F7FB6.90709@canterbury.ac.nz>

Jim Jewett wrote:

> Would it solve this problem if there were a PyTEMPREF that magically
> treated the refcount as an automatic variable?  (It increfed
> immediately, and decrefed whenever the function exited, without the
> user having to track this manually.)

This would be wrong, because most functions return
new references, which should *not* be increfed when
assigned to a variable.

How would you implement that in C anyway? (C++ could
do it, but we're not going there, as far as I know.)

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From greg.ewing at canterbury.ac.nz  Tue Sep 19 07:31:43 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 19 Sep 2006 17:31:43 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <aac2c7cb0609181334n36758748t9cfb4864bf7d9611@mail.gmail.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
	<aac2c7cb0609181334n36758748t9cfb4864bf7d9611@mail.gmail.com>
Message-ID: <450F80BF.5050203@canterbury.ac.nz>

We seem to have a situation where we have refcounting,
which incurs a small penalty many times, but which we're
willing to pay for the benefits it brings, and locking,
which in theory should also only have a small penalty
most of the time, not much bigger than refcounting,
but it seems we're not willing to pay both these
penalties at once.

I'm wondering whether there's some way we could merge
the two -- i.e. somehow make the one mechanism serve
as both a refcounting *and* a locking mechanism at
the same time. A refcount is a count, and a semaphore
also has a count... is there some way we can make
use of that?

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From greg.ewing at canterbury.ac.nz  Tue Sep 19 07:36:26 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 19 Sep 2006 17:36:26 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <BF6C3E46-E388-45F9-8E4D-69C10979CEC0@mac.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
	<BF6C3E46-E388-45F9-8E4D-69C10979CEC0@mac.com>
Message-ID: <450F81DA.6010204@canterbury.ac.nz>

Ronald Oussoren wrote:

> I'd be more worried about changes in semantics, it's pretty  convenient 
> to write 'open(somefile, 'r').read()' to read a file in  bulk, currently 
> this will immediately close the file but with a GC  system it may be a 
> long time before the file is actually closed.

Another data point in favour of deterministic memory
management: I was working on a game recently involving
OpenGL and animation, and I found that I couldn't get
a smooth frame rate until I disabled cyclic GC, after
which everything was fine.

So I'd be unhappy if refcounting were removed and not
replaced with something equally unobtrusive in the case
where you don't create a lot of cycles.

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From ronaldoussoren at mac.com  Tue Sep 19 07:46:42 2006
From: ronaldoussoren at mac.com (Ronald Oussoren)
Date: Tue, 19 Sep 2006 07:46:42 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <F53B4026-089C-4EFA-9C34-68A161D612DF@python.org>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
	<8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org>
	<fb6fbf560609181333p32171798q398a4f489f176e82@mail.gmail.com>
	<F53B4026-089C-4EFA-9C34-68A161D612DF@python.org>
Message-ID: <12CCD752-F18B-4DE0-91FC-D490C2E85421@mac.com>


On Sep 18, 2006, at 10:56 PM, Barry Warsaw wrote:

>
> I don't know, I hate macros. :)
>
> <talking from="my ass">
> It's been a long while since I programmed on the NeXT, so Mac folks
> here please chime in, but isn't there some Foundation idiom where
> temporary Objective-C objects didn't need to be explicitly released
> if their lifetime was exactly the duration of the function in which
> they were created?  ISTR something like the main event loop tracking
> such refcount=1 objects and deleting them automatically the next time
> through the loop.  Since Python has a main loop, I wonder if the same
> kind of trick couldn't be done here.

Objective-C, or rather Cocoa, uses reference counting but with a  
twist. Cocoa as autorelease pools (class NSAutoreleasePool) any  
object that is inserted in an autorelease pool gets its refcount  
decreased when the pool is deleted. Furthermore the main event loop  
creates a new pool at the start of the loop and removes it at the  
end, cleaning up all autoreleased objects.

Most cocoa methods return borrowed references (which they can do  
because of autorelease pools). When you know you won't hang onto an  
object until after the current iteration of the eventloop you can  
savely ignore reference counting. Only when you store a reference to  
an object somewhere (such as in an instance variable) you have to  
worry about the refcount.

<rant class="mini">
The annoying part of Cocoa's refcounting scheme is that they, unlike  
python, don't have a GC to clean up loops. This causes several parts  
of the Cocoa framework to ignore refcounts to avoid creating loops,  
which is rather annoying when you  write a bridge to Cocoa and want  
to hide reference counting details.
</rant>

Ronald

P.S. Apple is switching to a non-reference counting GC scheme in OSX  
10.5 (http://www.apple.com/macosx/leopard/xcode.html). 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2157 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20060919/230fec49/attachment.bin 

From greg.ewing at canterbury.ac.nz  Tue Sep 19 07:49:52 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 19 Sep 2006 17:49:52 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <F53B4026-089C-4EFA-9C34-68A161D612DF@python.org>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
	<8D168D7C-8B62-4522-B63C-0F1023AE85C4@python.org>
	<fb6fbf560609181333p32171798q398a4f489f176e82@mail.gmail.com>
	<F53B4026-089C-4EFA-9C34-68A161D612DF@python.org>
Message-ID: <450F8500.5080102@canterbury.ac.nz>

Barry Warsaw wrote:

> It's been a long while since I programmed on the NeXT, so Mac folks  
> here please chime in, but isn't there some Foundation idiom where  
> temporary Objective-C objects didn't need to be explicitly released  
> if their lifetime was exactly the duration of the function in which  
> they were created?

I think you're talking about the autorelease mechanism.
It's a kind of delayed decref, the delay being until
execution reaches some safe place, usually the main event
loop of the application.

It exists because Cocoa mostly manages refcounts on a much
coarser-grained scale than Python. You don't normally
count all the temporary references created by parameters
and local variables, only "major" ones such as references
stored in an instance variable of an object. The problem
then is that an object might get released while in the
middle of executing one or more of its methods, and there
are still references to it in active stack frames. By
delaying the decref until returning to the main loop, all
these references have hopefully gone away by the time
the object gets freed.

You couldn't translate this scheme directly into Python,
because there are various differences in the way refcounts
are used. There's also not really any safe place to do
the delayed decrefs. The interpreter loop is *not* a safe
place, because there can be nested invocations of it,
with C stack frames outside the current one holding
references.

-- 
Greg Ewing, Computer Science Dept, +--------------------------------------+
University of Canterbury,	   | Carpe post meridiem!          	  |
Christchurch, New Zealand	   | (I'm not a morning person.)          |
greg.ewing at canterbury.ac.nz	   +--------------------------------------+

From mcherm at mcherm.com  Tue Sep 19 14:36:09 2006
From: mcherm at mcherm.com (Michael Chermside)
Date: Tue, 19 Sep 2006 05:36:09 -0700
Subject: [Python-3000] Removing __del__
Message-ID: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>

The following comments got me thinking:

Raymond:
> Statistics incontrovertibly prove that people who habitually
> avoid __del__ lead happier lives and spend fewer hours in therapy ;-)

Adam Olsen:
> I agree here.  I think an executor approach is much better; kill the
> object, then make a weakref callback do any further cleanups using
> copies it made in advance.

And of course similar sentiments have been proposed in many Python
discussions by many people over several years.

Since we're apparently still in "propose wild ideas" mode for Py3K
I'd like to propose that for Py3K we remove __del__. Not "fix" it,
not "tweak" it, just remove it and perhaps add a note in the manual
pointing people to the weakref module.

What'cha think folks? I'd love to hear an opinion from someone who
is a current user of __del__ -- I'm not.

-- Michael Chermside

From qrczak at knm.org.pl  Tue Sep 19 16:33:52 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 19 Sep 2006 16:33:52 +0200
Subject: [Python-3000] Removing __del__
In-Reply-To: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>
	(Michael Chermside's message of "Tue, 19 Sep 2006 05:36:09 -0700")
References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>
Message-ID: <87irjk7wof.fsf@qrnik.zagroda>

Michael Chermside <mcherm at mcherm.com> writes:

> Adam Olsen:
>> I agree here.  I think an executor approach is much better; kill the
>> object, then make a weakref callback do any further cleanups using
>> copies it made in advance.

I agree. Objects with finalizers with the semantics of __del__ are
inherently unsafe: they can be finalized when they are still in use,
namely when they are used by a finalizer of another object.

The correct way is to register a finalizer from outside the object,
such that it's invoked asynchronously when the associated object has
been garbage collected. Everything reachable from a finalizer is
considered live.

As far as I understand it, Python's weakref have a mostly correct
semantics.

Finalizers must be invoked from a separate thread:
http://www.hpl.hp.com/techreports/2002/HPL-2002-335.html

The finalizer should not access the associated object itself (or it
will never be invoked), but it should only access the parts of the
object and other objects that it needs.

Sometimes it is necessary to split an object into an outer part which
triggers finalization, and an inner part which is accessed by the
finalizer. Even though this looks inconvenient, this design is
necessary for building rock solid finalizable objects.

This design allows for the presence of a finalizer to be a private
implementation detail. __del__ methods don't have this property
because objects with finalizers are unsafe to use from other
finalizers.

Python documentation contains the following snippet:

"Starting with version 1.5, Python guarantees that globals whose name
begins with a single underscore are deleted from their module before
other globals are deleted; if no other references to such globals
exist, this may help in assuring that imported modules are still
available at the time when the __del__() method is called."

This is clearly a hack which just increases the likelihood that the
code works. A correct design allows to make code which works in 100%
of the cases.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From qrczak at knm.org.pl  Tue Sep 19 16:42:18 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 19 Sep 2006 16:42:18 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org> (Barry
	Warsaw's message of "Mon, 18 Sep 2006 13:40:19 -0400")
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
Message-ID: <87eju79aut.fsf@qrnik.zagroda>

Barry Warsaw <barry at python.org> writes:

> What worries me is the unpredictability of gc vs. refcounting.  For
> some class of Python applications it's important that when an object
> is dereferenced it really goes away right then.  I /like/ reference
> counting!

This can be solved by explicit freeing of objects whose cleanup must
be performed deterministically.

Lisp has UNWIND-PROTECT and type-specific macros like WITH-OPEN-FILE.
C# has 'using' keyword. Python has 'with' which can be used for that.

Reference counting is inefficient, doesn't by itself handle cycles,
and is impractical to combine with threads which run in parallel. The
general consensus of modern language implementations is that a tracing
GC is the future.

I admit that implementing a good GC is hard. It's quite hard to make
it incremental, and it's hard to avoid stopping all threads during GC
(but it's easier to allow threads to run in parallel between GCs, with
no need of forced synchronization each time a reference to an object
is created).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From barry at python.org  Tue Sep 19 16:53:23 2006
From: barry at python.org (Barry Warsaw)
Date: Tue, 19 Sep 2006 10:53:23 -0400
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <87eju79aut.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
Message-ID: <CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sep 19, 2006, at 10:42 AM, Marcin 'Qrczak' Kowalczyk wrote:

> Barry Warsaw <barry at python.org> writes:
>
>> What worries me is the unpredictability of gc vs. refcounting.  For
>> some class of Python applications it's important that when an object
>> is dereferenced it really goes away right then.  I /like/ reference
>> counting!
>
> This can be solved by explicit freeing of objects whose cleanup must
> be performed deterministically.
>
> Lisp has UNWIND-PROTECT and type-specific macros like WITH-OPEN-FILE.
> C# has 'using' keyword. Python has 'with' which can be used for that.

I don't see how that helps.  I can remove all references to the  
object but I still have to wait until gc runs to free it.  Can you  
explain your idea in more detail?

> Reference counting is inefficient, doesn't by itself handle cycles,
> and is impractical to combine with threads which run in parallel. The
> general consensus of modern language implementations is that a tracing
> GC is the future.
>
> I admit that implementing a good GC is hard. It's quite hard to make
> it incremental, and it's hard to avoid stopping all threads during GC
> (but it's easier to allow threads to run in parallel between GCs, with
> no need of forced synchronization each time a reference to an object
> is created).

I just think that it's important to remember that there are use cases  
that reference counting solves.  GC and refcounting both have their  
pros and cons.  I tend to think that Python's current refcounting +  
cyclic gc is the devil we know, so unless there is a clear, proven  
better way I'm not eager to change it.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQCVAwUBRRAEa3EjvBPtnXfVAQIpyAQArWHs0j+yJs5raS4EQgj/v1NXYOqzXLAn
eM5eWMMTDY6qZgWa2i7DFciO1MZnX6/HAUsRYSc7lHPEWKMbNoCgPQZP46XoX8/w
FYtvuRCdVUlPvTtfZk8ltl/ERXb+vtR4Jtb/dT7+0VxdbGLHvqgMaCrcDXMd2n4C
du4cjV+GZ1k=
=anX9
-----END PGP SIGNATURE-----

From brian at sweetapp.com  Tue Sep 19 17:29:00 2006
From: brian at sweetapp.com (Brian Quinlan)
Date: Tue, 19 Sep 2006 17:29:00 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <87eju79aut.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>	<8764fl87j3.fsf@qrnik.zagroda>	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
Message-ID: <45100CBC.2060304@sweetapp.com>

Marcin 'Qrczak' Kowalczyk wrote:
> Reference counting is inefficient, doesn't by itself handle cycles,
> and is impractical to combine with threads which run in parallel. The
> general consensus of modern language implementations is that a tracing
> GC is the future.

How is reference counting inefficient?

Cheers,
Brian

From qrczak at knm.org.pl  Tue Sep 19 17:29:12 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 19 Sep 2006 17:29:12 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org> (Barry
	Warsaw's message of "Tue, 19 Sep 2006 10:53:23 -0400")
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
Message-ID: <878xkfc1tj.fsf@qrnik.zagroda>

Barry Warsaw <barry at python.org> writes:

> I don't see how that helps.  I can remove all references to the
> object but I still have to wait until gc runs to free it.  Can you
> explain your idea in more detail?

Objects which should be closed deterministically have the closing
action decoupled from the lifetime of the object. They are closed
explicitly; the object in a "closed" state doesn't take up any
sensitive resources.

> I just think that it's important to remember that there are use
> cases that reference counting solves. GC and refcounting both have
> their pros and cons.

Unfortunately it's hard to mix the two styles. Counting all reference
operations in the presence of a real GC would imply paying the costs
of both schemes together.

> I tend to think that Python's current refcounting + cyclic gc is the
> devil we know, so unless there is a clear, proven better way I'm not
> eager to change it.

They are different sets of tradeoffs; neither is universally better.
I claim that a tracing GC is usually better, or better in overall,
but it can't be proven to be better in all respects.

Changing an existing system creates more compatibility obstacles than
designing a system from scratch. I'm not convinced that it's practical
to change the Python GC now. I only wish it had a tracing GC instead.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From qrczak at knm.org.pl  Tue Sep 19 17:50:29 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 19 Sep 2006 17:50:29 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <45100CBC.2060304@sweetapp.com> (Brian Quinlan's message of
	"Tue, 19 Sep 2006 17:29:00 +0200")
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda> <45100CBC.2060304@sweetapp.com>
Message-ID: <87mz8vg8je.fsf@qrnik.zagroda>

Brian Quinlan <brian at sweetapp.com> writes:

>> Reference counting is inefficient, doesn't by itself handle cycles,
>> and is impractical to combine with threads which run in parallel. The
>> general consensus of modern language implementations is that a tracing
>> GC is the future.
>
> How is reference counting inefficient?

It involves operations every time an object is merely passed around,
as references to the object are created or destroyed.

It doesn't move objects in memory, and thus free memory is fragmented.
Memory allocation can't just chop from from a single area of free memory.
It can't allocate several objects with the cost of one allocation either.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From jimjjewett at gmail.com  Tue Sep 19 17:54:30 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 19 Sep 2006 11:54:30 -0400
Subject: [Python-3000] Removing __del__
In-Reply-To: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>
References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>
Message-ID: <fb6fbf560609190854n25a3ce75u7458a7d64c830048@mail.gmail.com>

On 9/19/06, Michael Chermside <mcherm at mcherm.com> wrote:
> The following comments got me thinking:

> Raymond:
> > Statistics incontrovertibly prove that people who habitually
> > avoid __del__ lead happier lives and spend fewer hours in therapy ;-)

> Adam Olsen:
> > I agree here.  I think an executor approach is much better; kill the
> > object, then make a weakref callback do any further cleanups using
> > copies it made in advance.

> Since we're apparently still in "propose wild ideas" mode for Py3K
> I'd like to propose that for Py3K we remove __del__. Not "fix" it,
> not "tweak" it, just remove it and perhaps add a note in the manual
> pointing people to the weakref module.

The various "create a separate closer object instead" recipescall seem
to cause a jump in complexity, particularly if you try for a general
solution.

I do think we should split __del__ into the (rare, problematic)
general case and a "special-purpose" lightweight __close__ version
that does a better job in the normal case.

For the general case, python refuses to guess about which order to
call __del__ cycles in; this has the unfortunately side effect of
making them immortal.

Almost all actual __del__ uses are effectively a call to self.close().
 The call might be required (Tk would leak if tkinter didn't notify
it), or it might just be good housekeeping.  The key point is that
order doesn't matter.

In practice they all seem to already be written defensively, so that
they can be called multiple times, or even after teardown has started.

So the semantics of __close__ would be just like those of __del__ except that
(1)  It would be called at least once if the process terminates normally.
(2)  Call order for linked objects would be arbitrary.

FWIW, I couldn't find a single example in the stdlib (outside of
tests) that wouldn't work at least as well if converted to a __close__
method.  (subprocess and popen2 would be harder if __close__ were a
once-only method, like I think generator close ended up becoming.)

-jJ

From barry at python.org  Tue Sep 19 18:01:53 2006
From: barry at python.org (Barry Warsaw)
Date: Tue, 19 Sep 2006 12:01:53 -0400
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <878xkfc1tj.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
	<878xkfc1tj.fsf@qrnik.zagroda>
Message-ID: <2CEDFB05-01F4-47C6-A8B7-460A9FFFD369@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sep 19, 2006, at 11:29 AM, Marcin 'Qrczak' Kowalczyk wrote:

> Barry Warsaw <barry at python.org> writes:
>
>> I don't see how that helps.  I can remove all references to the
>> object but I still have to wait until gc runs to free it.  Can you
>> explain your idea in more detail?
>
> Objects which should be closed deterministically have the closing
> action decoupled from the lifetime of the object. They are closed
> explicitly; the object in a "closed" state doesn't take up any
> sensitive resources.

It's not external resources I'm concerned about, it's the physical  
memory consumed in process by objects which are unreachable but not  
reclaimed.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQCVAwUBRRAUdXEjvBPtnXfVAQKFuQP9HmWucjJ//dTiEnEmjCgLNDbFF1J12c5U
KwZAbBZw0CFjtZXCF9/cGuZ+KWROJGIB7A6YnqqmuIXhJ82t6Qmvm257pvQkWe/5
HmZbLCPoGKzmL33ince2f5gLxqKzl90B2L24TLlEYvrfOS9KTe2ree3HJXmyuRz3
471OBzViVAA=
=WhVp
-----END PGP SIGNATURE-----

From brian at sweetapp.com  Tue Sep 19 18:05:46 2006
From: brian at sweetapp.com (Brian Quinlan)
Date: Tue, 19 Sep 2006 18:05:46 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <87mz8vg8je.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>	<8764fl87j3.fsf@qrnik.zagroda>	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>	<87eju79aut.fsf@qrnik.zagroda>
	<45100CBC.2060304@sweetapp.com> <87mz8vg8je.fsf@qrnik.zagroda>
Message-ID: <4510155A.5010308@sweetapp.com>

Marcin 'Qrczak' Kowalczyk wrote:
> Brian Quinlan <brian at sweetapp.com> writes:
> 
>>> Reference counting is inefficient, doesn't by itself handle cycles,
>>> and is impractical to combine with threads which run in parallel. The
>>> general consensus of modern language implementations is that a tracing
>>> GC is the future.
>> How is reference counting inefficient?

Do somehow know that tracing GC would be more efficient for typical 
python programs or are you just speculating?

> It involves operations every time an object is merely passed around,
> as references to the object are created or destroyed.

But if the lifetime of most objects is confined to a single function 
call, isn't reference counting going to be quite efficient?

> It doesn't move objects in memory, and thus free memory is fragmented.

OK. Have you had memory fragmentation problems with Python?

> Memory allocation can't just chop from from a single area of free memory.
> It can't allocate several objects with the cost of one allocation either.

I'm not sure what you mean here.

Cheers,
Brian

From barry at python.org  Tue Sep 19 18:10:35 2006
From: barry at python.org (Barry Warsaw)
Date: Tue, 19 Sep 2006 12:10:35 -0400
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <4510155A.5010308@sweetapp.com>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>	<8764fl87j3.fsf@qrnik.zagroda>	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>	<87eju79aut.fsf@qrnik.zagroda>
	<45100CBC.2060304@sweetapp.com> <87mz8vg8je.fsf@qrnik.zagroda>
	<4510155A.5010308@sweetapp.com>
Message-ID: <13D26562-CF05-44F3-A855-0CD280BA3140@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sep 19, 2006, at 12:05 PM, Brian Quinlan wrote:

> Marcin 'Qrczak' Kowalczyk wrote:
>> Brian Quinlan <brian at sweetapp.com> writes:
>>
>>>> Reference counting is inefficient, doesn't by itself handle cycles,
>>>> and is impractical to combine with threads which run in  
>>>> parallel. The
>>>> general consensus of modern language implementations is that a  
>>>> tracing
>>>> GC is the future.
>>> How is reference counting inefficient?
>
> Do somehow know that tracing GC would be more efficient for typical
> python programs or are you just speculating?

Also, what does "efficient" mean here?  Overall program run time?  No  
user-discernible pauses in operation?  Stinginess in overall memory use?

There are a lot of different efficiency parameters to consider, and  
of course different applications will care more about some than  
others.  A u/i-based tool doesn't want noticeable pauses.  A long  
running daemon wants manageable and predictable memory utilization.   
Etc.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQCVAwUBRRAWfHEjvBPtnXfVAQJ2oQP/UtBlUbCb74YfnmR6ueyL/DAxe0yT5sK6
0i1bqcStZeTsub1Hor0xYQ8VDTL38lR6L446vw5WehEmaDkK0v5zreNHCEYvaqFC
3nWm/xC9NUFJrONX+YzkBLOuEpW0g08imOsbgPdvEREopvsS5kJ4e9TrNeS+fRu8
x8CIY3r5Vm0=
=d/HI
-----END PGP SIGNATURE-----

From jcarlson at uci.edu  Tue Sep 19 18:23:01 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Tue, 19 Sep 2006 09:23:01 -0700
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <87mz8vg8je.fsf@qrnik.zagroda>
References: <45100CBC.2060304@sweetapp.com> <87mz8vg8je.fsf@qrnik.zagroda>
Message-ID: <20060919091147.07F3.JCARLSON@uci.edu>


"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> wrote:
> 
> Brian Quinlan <brian at sweetapp.com> writes:
> 
> >> Reference counting is inefficient, doesn't by itself handle cycles,
> >> and is impractical to combine with threads which run in parallel. The
> >> general consensus of modern language implementations is that a tracing
> >> GC is the future.
> >
> > How is reference counting inefficient?
> 
> It involves operations every time an object is merely passed around,
> as references to the object are created or destroyed.

Redefine the INC/DECREF macros to assign something like 2**30 as the
reference count in INCREF, and make DECREF do nothing.  A write of a
constant should be measurably faster than an increment.  Run some
relatively small test program (be concerned about memory!), and compare
the results to see if there is a substantial difference in performance.

> It doesn't move objects in memory, and thus free memory is fragmented.
> Memory allocation can't just chop from from a single area of free memory.
> It can't allocate several objects with the cost of one allocation either.

It can certainly allocate several objects with the cost of one
allocation, but it can't *deallocate* those objects individually.  See
the various freelists for examples where this is used successfully in
Python now.

 - Josiah


From qrczak at knm.org.pl  Tue Sep 19 18:37:55 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 19 Sep 2006 18:37:55 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <2CEDFB05-01F4-47C6-A8B7-460A9FFFD369@python.org> (Barry
	Warsaw's message of "Tue, 19 Sep 2006 12:01:53 -0400")
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
	<878xkfc1tj.fsf@qrnik.zagroda>
	<2CEDFB05-01F4-47C6-A8B7-460A9FFFD369@python.org>
Message-ID: <87r6y7ke1o.fsf@qrnik.zagroda>

Barry Warsaw <barry at python.org> writes:

> It's not external resources I'm concerned about, it's the physical
> memory consumed in process by objects which are unreachable but not
> reclaimed.

The rate of garbage collection depends on the rate of allocation.
While the objects are not freed in the earliest possible moment,
they are freed if the memory is needed for other objects.


Brian Quinlan <brian at sweetapp.com> writes:

> Do somehow know that tracing GC would be more efficient for typical 
> python programs or are you just speculating?

I'm mostly speculating. It's hard to measure the difference between
garbage collection schemes because most language runtimes are tied
to a particular GC implementation, and thus you can't substitute a
different GC leaving everything else the same.

I've done some experiments with C++-based GCs incl. reference counting,
but they were inconclusive. The effects strongly depend on the kind of
the program and the amount of memory it uses, and various GC schemes
are better or worse in different scenarios.

>> It involves operations every time an object is merely passed around,
>> as references to the object are created or destroyed.
>
> But if the lifetime of most objects is confined to a single function 
> call, isn't reference counting going to be quite efficient?

Even if an object begins and ends its lifetime within a particular
function call, it's usually passed down to other functions in the
meantime.

Every time a Python function returns an object, the reference count
on the result is incremented, and it's decremented at some time by
its caller. Every time a function implemented in Python is called,
reference counts of its parameters are incremented, and they are
decremented when it returns. Every time a None is stored in a data
structure or returned from a function, its reference count is
incremented. Every time a list is freed, reference counts of objects
it refers to is decremented. Every time two ints are added, the
reference count of the result is incremented even if that integer
was preallocated. Every time a field is assigned to, two reference
counts are manipulated.

>> It doesn't move objects in memory, and thus free memory is fragmented.
>
> OK. Have you had memory fragmentation problems with Python?

Indirectly: memory allocation can't be as fast as in some GC schemes.

>> Memory allocation can't just chop from from a single area of free memory.
>> It can't allocate several objects with the cost of one allocation either.
>
> I'm not sure what you mean here.

There are various GCs (including OCaml, and probably Sun's Java and
Microsoft's .NET implementations, and my language implementation,
and surely others) where the fast path of memory allocation looks
like stack allocation with overflow checking.

Moreover, if several objects are to be allocated at once (which I
admit is more likely in compiled code), the cost is still the same
as allocation of one object of the size being the sum of sizes of the
objects (not counting filling the objects with contents). There are no
per-allocated-object data to fill besides a header which points to a
static structure which describes the layout.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From qrczak at knm.org.pl  Tue Sep 19 18:55:26 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 19 Sep 2006 18:55:26 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <87mz8vg8je.fsf@qrnik.zagroda> (Marcin Kowalczyk's message of
	"Tue, 19 Sep 2006 17:50:29 +0200")
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda> <45100CBC.2060304@sweetapp.com>
	<87mz8vg8je.fsf@qrnik.zagroda>
Message-ID: <87wt7zzthd.fsf@qrnik.zagroda>

"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes:

> It involves operations every time an object is merely passed around,
> as references to the object are created or destroyed.

And it does something when it frees an object. In some GCs there is
a cost associated with keeping an object alive, but there is no
per-object cost when a group of objects die.

Most objects die young. This is what I've measured myself. When my
compiler runs, the average lifetime of an object is about 1/5 GCs.
This means that 80% of objects have only an allocation cost, while
freeing is free. And with a generational GC most of others are copied
only once: major GCs are less frequent than minor GCs.

It is true that a given long-living object has a larger cost, but such
objects are a minority, and I believe this scheme pays off. Especially
if it was implemented better than I did it; this is the only GC I've
implemented so far, I'm sure that experienced people can tune it better.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From qrczak at knm.org.pl  Tue Sep 19 20:48:11 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 19 Sep 2006 20:48:11 +0200
Subject: [Python-3000] Removing __del__
In-Reply-To: <fb6fbf560609190854n25a3ce75u7458a7d64c830048@mail.gmail.com> (Jim
	Jewett's message of "Tue, 19 Sep 2006 11:54:30 -0400")
References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>
	<fb6fbf560609190854n25a3ce75u7458a7d64c830048@mail.gmail.com>
Message-ID: <87lkofn15g.fsf@qrnik.zagroda>

"Jim Jewett" <jimjjewett at gmail.com> writes:

> I do think we should split __del__ into the (rare, problematic)
> general case and a "special-purpose" lightweight __close__ version
> that does a better job in the normal case.

A synchronous finalizer which doesn't keep object it refers to alive,
like Python's __del__, is sufficient when the finalizer doesn't use
other finalizable objects, and doesn't conflict with the rest of the
program in terms of potentially concurrent operations on shared data
(read/write or write/write).

Note that the concurrency conflict can manifest even in a
single-threaded program, because __del__ finalizers are in fact
semi-asynchronous: they are invoked when a reference count is
decremented and causes the relevant object to become dead, which
can happen in lots of places, even on a seemingly innocent variable
assignment.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From mcherm at mcherm.com  Tue Sep 19 21:54:15 2006
From: mcherm at mcherm.com (Michael Chermside)
Date: Tue, 19 Sep 2006 12:54:15 -0700
Subject: [Python-3000] Delayed reference counting idea
Message-ID: <20060919125415.u61ujgr44siscwk0@login.werra.lunarpages.com>

Speaking on the speed of GC implementations, Marcin writes:
> I'm mostly speculating. It's hard to measure the difference between
> garbage collection schemes because most language runtimes are tied
> to a particular GC implementation, and thus you can't substitute a
> different GC leaving everything else the same.

Interestingly, one of the original goals of PyPy was to create a
test bed in which it was easy to experiment and answer just this
kind of question. Unfortunately, although they have an architechure
allowing pluggable GC algorithms (what an incredible concept!) I
don't belive that any reliable conclusions can be drawn from things
as they now stand.

For more details see
http://codespeak.net/pypy/dist/pypy/doc/garbage_collection.html

-- Michael Chermside

From martin at v.loewis.de  Tue Sep 19 23:32:07 2006
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Tue, 19 Sep 2006 23:32:07 +0200
Subject: [Python-3000] Kill GIL?
In-Reply-To: <1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>	<450E34EF.3090202@solarsail.hcs.harvard.edu>
	<1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com>
Message-ID: <451061D7.4010105@v.loewis.de>

Paul Prescod schrieb:
> Even if I won't contribute code or even a design to the solution
> (because it isn't an area of expertise and I'm still working on
> encodings stuff) I think that there would be value in saying: "There's a
> big problem here and we intend to fix it in Python 3000."

This is of value only if "we" really intend to fix it. I don't,
and apparently, you don't either. It would be very bad to claim
that "we" fix it, and then don't. It's much much much much better
to acknowledge that "we" aren't going to fix it, not with Python
3.0, and likely not with any release in the foreseeable future.
The only exception would be if somebody offered a reasonable
solution, which "we" would just have to incorporate (and possibly
maintain, although it would be good if the original author would
be around for a year or so).

Regards,
Martin


From martin at v.loewis.de  Tue Sep 19 23:34:53 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 19 Sep 2006 23:34:53 +0200
Subject: [Python-3000] Kill GIL?
In-Reply-To: <450E792C.1070105@gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>		<450D4AAE.2000805@gmail.com>	<bbaeab100609171103w88e64a6k1a0bad9887b282a4@mail.gmail.com>
	<450E792C.1070105@gmail.com>
Message-ID: <4510627D.2080704@v.loewis.de>

Nick Coghlan schrieb:
> I was thinking it would be easier to split out the Global Interpreter Lock and 
> a per-interpreter Local Interpreter Lock, rather than trying to go to a full 
> free-threading model. Anyone sharing other objects between interpreters would 
> still need their own synchronisation mechanism, but something like 
> threading.Queue should suffice for that.

The challenge with that is "global" (i.e. across-interpreter) objects.
There are several of these: the obvious singletons (None, True, False),
the non-obvious singletons ((), -2..100 or so), and the extension module
globals (types, and in particular exceptions).

Do you want them still to be global, or per-interpreter?

Regards,
Martin

From martin at v.loewis.de  Tue Sep 19 23:41:05 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 19 Sep 2006 23:41:05 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <87mz8vg8je.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>	<8764fl87j3.fsf@qrnik.zagroda>	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>	<87eju79aut.fsf@qrnik.zagroda>
	<45100CBC.2060304@sweetapp.com> <87mz8vg8je.fsf@qrnik.zagroda>
Message-ID: <451063F1.9050207@v.loewis.de>

Marcin 'Qrczak' Kowalczyk schrieb:
> It doesn't move objects in memory, and thus free memory is fragmented.

That's true, but not a problem.

> Memory allocation can't just chop from from a single area of free memory.

That's not true. Python does it all the time. Allocation is in constant
time most of the time (in some applications, it's always constant).

Regards,
Martin

From rasky at develer.com  Tue Sep 19 23:42:43 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Tue, 19 Sep 2006 23:42:43 +0200
Subject: [Python-3000] Removing __del__
References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>
Message-ID: <023701c6dc34$8a79dc50$a14c2597@bagio>

Michael Chermside <mcherm at mcherm.com> wrote:

> Since we're apparently still in "propose wild ideas" mode for Py3K
> I'd like to propose that for Py3K we remove __del__. Not "fix" it,
> not "tweak" it, just remove it and perhaps add a note in the manual
> pointing people to the weakref module.


I don't use __del__ much. I use it only in leaf classes, where it surely can't
be part of loops. In those rare cases, it's very useful to me. For instance, I
have a small classes which wraps an existing handle-based C API exported to
Python. Something along the lines of:

class Wrapper:
    def __init__(self, *args):
           self.handle = CAPI.init(*args)

    def __del__(self, *args):
            CAPI.close(self.handle)

    def foo(self):
            CAPI.foo(self.handle)

The real class isn't much longer than this (really). How do you propose to
write this same code without __del__?

Notice that I'd be perfectly fine with the __close__ semantic prosed in this
thread (might be called more than once, order within the loop doesn't matter).

Giovanni Bajo


From rasky at develer.com  Tue Sep 19 23:50:34 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Tue, 19 Sep 2006 23:50:34 +0200
Subject: [Python-3000] Delayed reference counting idea
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF4695@hammer.office.bhtrader.com>
	<450F7E00.7000304@canterbury.ac.nz>
Message-ID: <026e01c6dc35$a4ee83f0$a14c2597@bagio>

Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:

>> * An easier C API would significantly benefit the language in terms
>> of more extensions being available and in terms of increased
>> reliability for those extensions.  The current refcount scheme
>> results in pervasive refleak bugs and subsequent, interminable
>> bughunts.
>
> It's not clear that a different scheme would be much
> different, though. If it's not refcounting, there will
> be some other set of rules that must be followed, with
> equally obscure bugs if you slip up.

Agreed.

> Also, at least half of the boilerplate is due to the
> necessity of checking for errors at each step. A
> different GC scheme wouldn't help with that.

Given that C handles in an equally-bad fashion errors (need manual checks at
every step) and finalizers (need to manually refcount and de-refcount), maybe a
C++Python is in order? ATL helped somewhat with COM refcounting, after all.

Giovanni Bajo


From rhamph at gmail.com  Wed Sep 20 00:31:36 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Tue, 19 Sep 2006 16:31:36 -0600
Subject: [Python-3000] Removing __del__
In-Reply-To: <023701c6dc34$8a79dc50$a14c2597@bagio>
References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>
	<023701c6dc34$8a79dc50$a14c2597@bagio>
Message-ID: <aac2c7cb0609191531r425c6626n2630c4c611e190b8@mail.gmail.com>

On 9/19/06, Giovanni Bajo <rasky at develer.com> wrote:
> Michael Chermside <mcherm at mcherm.com> wrote:
>
> > Since we're apparently still in "propose wild ideas" mode for Py3K
> > I'd like to propose that for Py3K we remove __del__. Not "fix" it,
> > not "tweak" it, just remove it and perhaps add a note in the manual
> > pointing people to the weakref module.
>
>
> I don't use __del__ much. I use it only in leaf classes, where it surely can't
> be part of loops. In those rare cases, it's very useful to me. For instance, I
> have a small classes which wraps an existing handle-based C API exported to
> Python. Something along the lines of:
>
> class Wrapper:
>     def __init__(self, *args):
>            self.handle = CAPI.init(*args)
>
>     def __del__(self, *args):
>             CAPI.close(self.handle)
>
>     def foo(self):
>             CAPI.foo(self.handle)
>
> The real class isn't much longer than this (really). How do you propose to
> write this same code without __del__?

I've experimented with using metaclasses to do some fun here.  It
could look something like this:

Class Wrapper(Core):
    def __init__(self, *args):
        Core.__init__(self)
        self.core.handle = CAPI.init(*args)

    @coremethod
    def __coredel__(core):
        CAPI.close(core.handle)

    def foo(self):
        CAPI.foo(self.core.handle)

Works just fine in 2.x.

-- 
Adam Olsen, aka Rhamphoryncus

From exarkun at divmod.com  Wed Sep 20 00:40:48 2006
From: exarkun at divmod.com (Jean-Paul Calderone)
Date: Tue, 19 Sep 2006 18:40:48 -0400
Subject: [Python-3000] Removing __del__
In-Reply-To: <023701c6dc34$8a79dc50$a14c2597@bagio>
Message-ID: <20060919224048.1717.886353737.divmod.quotient.54336@ohm>

On Tue, 19 Sep 2006 23:42:43 +0200, Giovanni Bajo <rasky at develer.com> wrote:
>Michael Chermside <mcherm at mcherm.com> wrote:
>
>> Since we're apparently still in "propose wild ideas" mode for Py3K
>> I'd like to propose that for Py3K we remove __del__. Not "fix" it,
>> not "tweak" it, just remove it and perhaps add a note in the manual
>> pointing people to the weakref module.
>
>
>I don't use __del__ much. I use it only in leaf classes, where it surely can't
>be part of loops. In those rare cases, it's very useful to me. For instance, I
>have a small classes which wraps an existing handle-based C API exported to
>Python. Something along the lines of:
>
>class Wrapper:
>    def __init__(self, *args):
>           self.handle = CAPI.init(*args)
>
>    def __del__(self, *args):
>            CAPI.close(self.handle)
>
>    def foo(self):
>            CAPI.foo(self.handle)
>
>The real class isn't much longer than this (really). How do you propose to
>write this same code without __del__?

Untested, but roughly:

    _weakrefs = []

    def _cleanup(ref, handle):
        _weakrefs.remove(ref)
        CAPI.close(handle)

    class BetterWrapper:
        def __init__(self, *args):
            handle = self.handle = CAPI.init(*args)
            _weakrefs.append(
                weakref.ref(self,
                    lambda ref: _cleanup(ref, handle)))

    def foo(self):
        CAPI.foo(self.handle)

There are probably even better ways too, this is just the first that comes
to mind.

Jean-Paul

From greg.ewing at canterbury.ac.nz  Wed Sep 20 02:22:47 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 20 Sep 2006 12:22:47 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <878xkfc1tj.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
	<878xkfc1tj.fsf@qrnik.zagroda>
Message-ID: <451089D7.6060204@canterbury.ac.nz>

Marcin 'Qrczak' Kowalczyk wrote:

> Objects which should be closed deterministically have the closing
> action decoupled from the lifetime of the object.

That doesn't cover the case where the "closing" action
you want includes freeing the memory occupied by the
object. The game I mentioned earlier is one of those --
I don't need anything "closed", I just want the memory


  They are closed
> explicitly; the object in a "closed" state doesn't take up any
> sensitive resources.
> 
> 
>>I just think that it's important to remember that there are use
>>cases that reference counting solves. GC and refcounting both have
>>their pros and cons.
> 
> 
> Unfortunately it's hard to mix the two styles. Counting all reference
> operations in the presence of a real GC would imply paying the costs
> of both schemes together.
> 
> 
>>I tend to think that Python's current refcounting + cyclic gc is the
>>devil we know, so unless there is a clear, proven better way I'm not
>>eager to change it.
> 
> 
> They are different sets of tradeoffs; neither is universally better.
> I claim that a tracing GC is usually better, or better in overall,
> but it can't be proven to be better in all respects.
> 
> Changing an existing system creates more compatibility obstacles than
> designing a system from scratch. I'm not convinced that it's practical
> to change the Python GC now. I only wish it had a tracing GC instead.
> 


From greg.ewing at canterbury.ac.nz  Wed Sep 20 02:25:07 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 20 Sep 2006 12:25:07 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <87mz8vg8je.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda> <45100CBC.2060304@sweetapp.com>
	<87mz8vg8je.fsf@qrnik.zagroda>
Message-ID: <45108A63.9090506@canterbury.ac.nz>

Marcin 'Qrczak' Kowalczyk wrote:

> It doesn't move objects in memory, and thus free memory is fragmented.

That's not a feature of refcounting as such. With
sufficient indirection, moveable refcounted memory
blocks are possible (early Smalltalks worked that
way, I believe).

--
Greg

From greg.ewing at canterbury.ac.nz  Wed Sep 20 02:34:06 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 20 Sep 2006 12:34:06 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <20060919125415.u61ujgr44siscwk0@login.werra.lunarpages.com>
References: <20060919125415.u61ujgr44siscwk0@login.werra.lunarpages.com>
Message-ID: <45108C7E.1080104@canterbury.ac.nz>

Michael Chermside wrote:

> Interestingly, one of the original goals of PyPy was to create a
> test bed in which it was easy to experiment and answer just this
> kind of question.

A worry about that is whether the architecture required to
allow pluggable GC implementations introduces inefficiencies
of its own that would skew the results.

--
Greg

From bob at redivi.com  Wed Sep 20 02:47:05 2006
From: bob at redivi.com (Bob Ippolito)
Date: Tue, 19 Sep 2006 17:47:05 -0700
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <45108C7E.1080104@canterbury.ac.nz>
References: <20060919125415.u61ujgr44siscwk0@login.werra.lunarpages.com>
	<45108C7E.1080104@canterbury.ac.nz>
Message-ID: <6a36e7290609191747r215f7184uc2cf2bd5679b821e@mail.gmail.com>

On 9/19/06, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Michael Chermside wrote:
>
> > Interestingly, one of the original goals of PyPy was to create a
> > test bed in which it was easy to experiment and answer just this
> > kind of question.
>
> A worry about that is whether the architecture required to
> allow pluggable GC implementations introduces inefficiencies
> of its own that would skew the results.
>

There's no need to worry about that in the case of PyPy. Those kinds
of choices are made way before runtime, so there's no required
indirection.

-bob

From qrczak at knm.org.pl  Wed Sep 20 03:01:33 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Wed, 20 Sep 2006 03:01:33 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <45108A63.9090506@canterbury.ac.nz> (Greg Ewing's message of
	"Wed, 20 Sep 2006 12:25:07 +1200")
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda> <45100CBC.2060304@sweetapp.com>
	<87mz8vg8je.fsf@qrnik.zagroda> <45108A63.9090506@canterbury.ac.nz>
Message-ID: <8764fjnyfm.fsf@qrnik.zagroda>

Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

>> It doesn't move objects in memory, and thus free memory is fragmented.
>
> That's not a feature of refcounting as such. With sufficient
> indirection, moveable refcounted memory blocks are possible
> (early Smalltalks worked that way, I believe).

Yes, but the indirection is a cost in itself. A tracing GC can move
objects without such indirection, because it can update all pointers
to the given object.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From greg.ewing at canterbury.ac.nz  Wed Sep 20 02:59:14 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 20 Sep 2006 12:59:14 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <6a36e7290609191747r215f7184uc2cf2bd5679b821e@mail.gmail.com>
References: <20060919125415.u61ujgr44siscwk0@login.werra.lunarpages.com>
	<45108C7E.1080104@canterbury.ac.nz>
	<6a36e7290609191747r215f7184uc2cf2bd5679b821e@mail.gmail.com>
Message-ID: <45109262.5020308@canterbury.ac.nz>

Bob Ippolito wrote:

> There's no need to worry about that in the case of PyPy. Those kinds
> of choices are made way before runtime, so there's no required
> indirection.

Even so, we're talking about machine-generated code rather
than the sort of hand-crafting you need to get the best
out of something critical like GC. There could still be
room for inefficiencies.

--
Greg

From ironfroggy at gmail.com  Wed Sep 20 04:54:07 2006
From: ironfroggy at gmail.com (Calvin Spealman)
Date: Tue, 19 Sep 2006 22:54:07 -0400
Subject: [Python-3000] Kill GIL?
In-Reply-To: <4510627D.2080704@v.loewis.de>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>
	<450D4AAE.2000805@gmail.com>
	<bbaeab100609171103w88e64a6k1a0bad9887b282a4@mail.gmail.com>
	<450E792C.1070105@gmail.com> <4510627D.2080704@v.loewis.de>
Message-ID: <76fd5acf0609191954p3979ba48v89e520fdf8e3d124@mail.gmail.com>

On 9/19/06, "Martin v. L?wis" <martin at v.loewis.de> wrote:
> Nick Coghlan schrieb:
> > I was thinking it would be easier to split out the Global Interpreter Lock and
> > a per-interpreter Local Interpreter Lock, rather than trying to go to a full
> > free-threading model. Anyone sharing other objects between interpreters would
> > still need their own synchronisation mechanism, but something like
> > threading.Queue should suffice for that.
>
> The challenge with that is "global" (i.e. across-interpreter) objects.
> There are several of these: the obvious singletons (None, True, False),
> the non-obvious singletons ((), -2..100 or so), and the extension module
> globals (types, and in particular exceptions).
>
> Do you want them still to be global, or per-interpreter?
>
> Regards,
> Martin

It is one fixable problem among many, but fixable none-the-less. Any
solution is going to break the API, but that should be allowed,
especially for something as important as this. The obvious and
non-obvious singletons don't represent much of a real problem, when
you realize that you'll have to change the locking API anyway, at
least to specify some interpreter ID for which to operate on its Local
Interpreter Lock. Should you check every object on what it is? No, so
either don't have cross-interpreter globals, which it doesn't save you
much anyway, or add a lock pointer to all PyObject structs, which can
be a single GIL, a LIL, or something else down the road. The API will
need to add a parameter to the locking anyway, so the door is already
open and singletons arent getting in the way.

From martin at v.loewis.de  Wed Sep 20 08:02:30 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Wed, 20 Sep 2006 08:02:30 +0200
Subject: [Python-3000] Kill GIL?
In-Reply-To: <76fd5acf0609191954p3979ba48v89e520fdf8e3d124@mail.gmail.com>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>	
	<450D4AAE.2000805@gmail.com>	
	<bbaeab100609171103w88e64a6k1a0bad9887b282a4@mail.gmail.com>	
	<450E792C.1070105@gmail.com> <4510627D.2080704@v.loewis.de>
	<76fd5acf0609191954p3979ba48v89e520fdf8e3d124@mail.gmail.com>
Message-ID: <4510D976.1060600@v.loewis.de>

Calvin Spealman schrieb:
>> The challenge with that is "global" (i.e. across-interpreter) objects.
>> There are several of these: the obvious singletons (None, True, False),
>> the non-obvious singletons ((), -2..100 or so), and the extension module
>> globals (types, and in particular exceptions).
>>
>> Do you want them still to be global, or per-interpreter?
>>
>> Regards,
>> Martin
> 
> It is one fixable problem among many, but fixable none-the-less.
[...]

Your message didn't really answer the question, did it?

Regards,
Martin

From qrczak at knm.org.pl  Wed Sep 20 08:25:52 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Wed, 20 Sep 2006 08:25:52 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <451089D7.6060204@canterbury.ac.nz> (Greg Ewing's message of
	"Wed, 20 Sep 2006 12:22:47 +1200")
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
	<878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz>
Message-ID: <87hcz38367.fsf@qrnik.zagroda>

Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

> That doesn't cover the case where the "closing" action
> you want includes freeing the memory occupied by the
> object. The game I mentioned earlier is one of those --
> I don't need anything "closed", I just want the memory

Why do you want to free memory at a particular point of time?

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From greg.ewing at canterbury.ac.nz  Wed Sep 20 10:53:20 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 20 Sep 2006 20:53:20 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <87hcz38367.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
	<878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz>
	<87hcz38367.fsf@qrnik.zagroda>
Message-ID: <45110180.4070807@canterbury.ac.nz>

Marcin 'Qrczak' Kowalczyk wrote:

> Why do you want to free memory at a particular point of time?

I don't. However, I *do* want it freed by the time I
need it again, and I *don't* want unpredictable pauses
to catch up on backed-up memory-freeing, so that my
animations run smoothly.

--
Greg

From ncoghlan at gmail.com  Wed Sep 20 12:12:01 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 20 Sep 2006 20:12:01 +1000
Subject: [Python-3000] Kill GIL?
In-Reply-To: <4510627D.2080704@v.loewis.de>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>		<450D4AAE.2000805@gmail.com>	<bbaeab100609171103w88e64a6k1a0bad9887b282a4@mail.gmail.com>
	<450E792C.1070105@gmail.com> <4510627D.2080704@v.loewis.de>
Message-ID: <451113F1.2030302@gmail.com>

Martin v. L?wis wrote:
> Nick Coghlan schrieb:
>> I was thinking it would be easier to split out the Global Interpreter Lock and 
>> a per-interpreter Local Interpreter Lock, rather than trying to go to a full 
>> free-threading model. Anyone sharing other objects between interpreters would 
>> still need their own synchronisation mechanism, but something like 
>> threading.Queue should suffice for that.
> 
> The challenge with that is "global" (i.e. across-interpreter) objects.
> There are several of these: the obvious singletons (None, True, False),
> the non-obvious singletons ((), -2..100 or so), and the extension module
> globals (types, and in particular exceptions).
> 
> Do you want them still to be global, or per-interpreter?

The GIL would still exist - the idea would be that most threads would be 
spending most of their time holding only their local interpreter lock.

Only when reading or writing the state shared between interpreters would a 
thread need to acquire the GIL. Alternatively, the GIL might be able to be 
turned into a read/write lock instead of a basic mutex, with threads normally 
holding a read lock which they periodically release & reacquire (in case there 
are any other threads trying to acquire).

The latter approach would probably give better performance (since you wouldn't 
need to be dropping and reacquiring the GIL in order to access the singleton 
objects).

Cheers,
Nick.

P.S. Just to be clear, I don't think doing this would be *easy*, but unlike 
full free-threading, I think it is at least potentially workable.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From qrczak at knm.org.pl  Wed Sep 20 12:57:59 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Wed, 20 Sep 2006 12:57:59 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <45110180.4070807@canterbury.ac.nz> (Greg Ewing's message of
	"Wed, 20 Sep 2006 20:53:20 +1200")
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
	<878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz>
	<87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz>
Message-ID: <87psdqztxk.fsf@qrnik.zagroda>

Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

>> Why do you want to free memory at a particular point of time?
>
> I don't. However, I *do* want it freed by the time I need it again,

As I said, the rate of GC depends on the rate of allocation.
Unreachable objects are collected when memory is needed for
allocation.

> and I *don't* want unpredictable pauses to catch up on backed-up
> memory-freeing,

Incremental GC (e.g. in OCaml) has short pauses. It doesn't scan all
memory at once, but distributes the work among GC cycles.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From ncoghlan at gmail.com  Wed Sep 20 13:18:58 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 20 Sep 2006 21:18:58 +1000
Subject: [Python-3000] Removing __del__
In-Reply-To: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>
References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>
Message-ID: <451123A2.7040701@gmail.com>

Michael Chermside wrote:
> The following comments got me thinking:
> 
> Raymond:
>> Statistics incontrovertibly prove that people who habitually
>> avoid __del__ lead happier lives and spend fewer hours in therapy ;-)
> 
> Adam Olsen:
>> I agree here.  I think an executor approach is much better; kill the
>> object, then make a weakref callback do any further cleanups using
>> copies it made in advance.
 >
> What'cha think folks? I'd love to hear an opinion from someone who
> is a current user of __del__ -- I'm not.

How about an API change and a tweak to type.__call__, rather than complete 
removal?

I've re-used __del__ as the method name below, but a different name would 
obviously work too.

1. __del__ would become an automatic static method (like __new__)

2. Make an addition to the end of type.__call__ along the lines of (stealing 
from Jean-Paul's example):

     # sys.finalizers would just be a new global set in the sys module
     # that keeps the weakrefs alive until they are needed

     # In definition of type.__call__, after invoking __init__
     if hasattr(cls, '__del__'):
         finalizer = cls.__del__
         if hasattr(self, '__del_arg__'):
             finalizer_arg = self.__del_arg__
         else:
             # Create a class with the same instance attributes
             # as the original
             class attr_holder(object):
                 pass
             finalizer_arg = attr_holder()
             finalizer_arg.__dict__ = self.__dict__
         def call_finalizer(ref):
               sys.finalizers.remove(ref)
               finalizer(finalizer_arg)
         sys.finalizers.add(weakref.ref(self, call_finalizer))

3. The __init__ method then simply needs to make sure that the right argument 
is passed to __del__. For example, if the object holds a reference to a file 
that needs to be closed when the object goes away:

   class CloseFileOnDel(object):
       def __init__(self, fname):
           self.f = self.__del_arg__ = open(fname)
       def __del__(f):
           f.close()

Alternatively, the class could rely on the pseudo-self that is passed if 
__del_arg__ isn't defined:

   class CloseFileOnDel(object):
       def __init__(self, fname):
           self.f = open(fname)
       def __del__(self_attrs):
           self_attrs.f.close()

The only way for __del__ to receive a reference to self is if the finalizer 
argument had a reference to it - but that would mean the object itself was not 
collectable, so __del__ wouldn't be called in the first place.

That all seems too simple, though. Since we're talking about gc and that's 
never simple, there has to be something wrong with the idea :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From rhamph at gmail.com  Wed Sep 20 14:55:56 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Wed, 20 Sep 2006 06:55:56 -0600
Subject: [Python-3000] How will unicode get used?
Message-ID: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>

Before we can decide on the internal representation of our unicode
objects, we need to decide on their external interface.  My thoughts
so far:

* Most transformation and testing methods (.lower(), .islower(), etc)
can be copied directly from 2.x.  They require no special
implementation to perform reasonably.
* Indexing and slicing is the big issue.  Do we need constant-time
integer slicing?  .find() could be changed to return a token that
could be used as a constant-time offset.  Incrementing the token would
have linear costs, but that's no big deal if the offsets are always
small.
* Grapheme clusters, words, lines, other groupings, do we need/want
ways to slice based on them too?
* Cheap slicing and concatenation (between O(1) and O(log(n))), do we
want to support them?  Now would be the time.

-- 
Adam Olsen, aka Rhamphoryncus

From krstic at solarsail.hcs.harvard.edu  Wed Sep 20 11:18:22 2006
From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=)
Date: Wed, 20 Sep 2006 05:18:22 -0400
Subject: [Python-3000] Kill GIL?
In-Reply-To: <451061D7.4010105@v.loewis.de>
References: <7008329d0609170528m5ca38011s292ce21396fa879@mail.gmail.com>	<450E34EF.3090202@solarsail.hcs.harvard.edu>
	<1cb725390609180119i548a5a65j841599633cba712f@mail.gmail.com>
	<451061D7.4010105@v.loewis.de>
Message-ID: <4511075E.6010101@solarsail.hcs.harvard.edu>

Martin v. L?wis wrote:
> The only exception would be if somebody offered a reasonable
> solution, which "we" would just have to incorporate (and possibly
> maintain, although it would be good if the original author would
> be around for a year or so).

I am interested in doing just this. I'm loathe to spending time on it,
however, if it turns out that Guido still doesn't think multiprocessing
is a problem, or has a particular solution in mind. So once that clears
up, I'm happy to commit to a PEP, a reference implementation (if it can
be done purely in Python; if it involves diving into CPython, I'll
require assistance), and ongoing maintenance of the same for the
foreseeable future.

-- 
Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> | GPG: 0x147C722D


From mcherm at mcherm.com  Wed Sep 20 15:24:01 2006
From: mcherm at mcherm.com (Michael Chermside)
Date: Wed, 20 Sep 2006 06:24:01 -0700
Subject: [Python-3000] Removing __del__
Message-ID: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com>

Nick Coghlan writes:
    [...proposes revision of __del__ rather than removal...]
> The only way for __del__ to receive a reference to self is if the  
> finalizer argument had a reference to it - but that would mean the  
> object itself was not
> collectable, so __del__ wouldn't be called in the first place.
>
> That all seems too simple, though. Since we're talking about gc and  
> that's never simple, there has to be something wrong with the idea :)

Unfortunately you're right... this is all too simple. The existing
mechanism doesn't have a problem with __del__ methods that do not
participate in loops. For those that DO participate in loops I
think it's perfectly plausible for your __del__ to receive a reference
to the actual object being finalized.

Another problem (but less important as it's trivially fixable) is that
you're storing away the values that the object had when it was created,
perhaps missing out on things that got added or initialized later.

-- Michael Chermside


From krstic at solarsail.hcs.harvard.edu  Wed Sep 20 15:32:47 2006
From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=)
Date: Wed, 20 Sep 2006 21:32:47 +0800
Subject: [Python-3000] Kill GIL? - to PEP 3099?
In-Reply-To: <fb6fbf560609180856s1b63ded2o66061d453368a9@mail.gmail.com>
References: <fb6fbf560609180856s1b63ded2o66061d453368a9@mail.gmail.com>
Message-ID: <451142FF.20203@solarsail.hcs.harvard.edu>

Jim Jewett wrote:
>> > Ivan: why don't you write a PEP about this?
> 
>> I'd like to hear Guido's overarching thoughts on the matter, if any, and
>> would afterwards be happy to write a PEP.

The `this` and `the matter` referred not to removing the GIL, but
providing some form of sane multiprocessing support that doesn't require
everyone interested in MP to reinvent the wheel. The GIL situation and
Guido's position on it seem pretty clear to me, as I've tried to
indicate in prior messages.

-- 
Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> | GPG: 0x147C722D

From rasky at develer.com  Wed Sep 20 15:36:48 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Wed, 20 Sep 2006 15:36:48 +0200
Subject: [Python-3000] Removing __del__
References: <023701c6dc34$8a79dc50$a14c2597@bagio>
	<20060919224048.1717.886353737.divmod.quotient.54336@ohm>
Message-ID: <016801c6dcb9$d2915c40$e303030a@trilan>

Jean-Paul Calderone wrote:

>>> Since we're apparently still in "propose wild ideas" mode for Py3K
>>> I'd like to propose that for Py3K we remove __del__. Not "fix" it,
>>> not "tweak" it, just remove it and perhaps add a note in the manual
>>> pointing people to the weakref module.
>>
>>
>> I don't use __del__ much. I use it only in leaf classes, where it
>> surely can't be part of loops. In those rare cases, it's very useful
>> to me. For instance, I have a small classes which wraps an existing
>> handle-based C API exported to Python. Something along the lines of:
>>
>> class Wrapper:
>>    def __init__(self, *args):
>>           self.handle = CAPI.init(*args)
>>
>>    def __del__(self, *args):
>>            CAPI.close(self.handle)
>>
>>    def foo(self):
>>            CAPI.foo(self.handle)
>>
>> The real class isn't much longer than this (really). How do you
>> propose to write this same code without __del__?
>
> Untested, but roughly:
>
>     _weakrefs = []
>
>     def _cleanup(ref, handle):
>         _weakrefs.remove(ref)
>         CAPI.close(handle)
>
>     class BetterWrapper:
>         def __init__(self, *args):
>             handle = self.handle = CAPI.init(*args)
>             _weakrefs.append(
>                 weakref.ref(self,
>                     lambda ref: _cleanup(ref, handle)))
>
>     def foo(self):
>         CAPI.foo(self.handle)
>
> There are probably even better ways too, this is just the first that
> comes to mind.

Thanks for the example.

Thus, I believe my example is a good use case for __del__ with no good
enough workaround, which was requested by Micheal in the original post. I
believe that it would be a mistake to remove __del__ unless we provide a
graceful alternative (and I don't consider the code above a graceful
alternative). I still like the __close__ method being proposed. I'd love to
see a PEP for it.
-- 
Giovanni Bajo


From fredrik at pythonware.com  Wed Sep 20 15:38:59 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 20 Sep 2006 15:38:59 +0200
Subject: [Python-3000] Kill GIL? - to PEP 3099?
In-Reply-To: <451142FF.20203@solarsail.hcs.harvard.edu>
References: <fb6fbf560609180856s1b63ded2o66061d453368a9@mail.gmail.com>
	<451142FF.20203@solarsail.hcs.harvard.edu>
Message-ID: <eerg9j$pj8$1@sea.gmane.org>

Ivan Krsti? wrote:

> The `this` and `the matter` referred not to removing the GIL, but
> providing some form of sane multiprocessing support that doesn't require
> everyone interested in MP to reinvent the wheel.

no need to wait for Guido for this: adding library support for shared-
memory dictionaries/lists is a no-brainer.  if you have experience in 
this field, start hacking.  I'll take care of the rest ;-)

</F>


From fredrik at pythonware.com  Wed Sep 20 16:02:28 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 20 Sep 2006 16:02:28 +0200
Subject: [Python-3000] Kill GIL? - to PEP 3099?
In-Reply-To: <eerg9j$pj8$1@sea.gmane.org>
References: <fb6fbf560609180856s1b63ded2o66061d453368a9@mail.gmail.com>	<451142FF.20203@solarsail.hcs.harvard.edu>
	<eerg9j$pj8$1@sea.gmane.org>
Message-ID: <451149F4.2040501@pythonware.com>

Fredrik Lundh wrote:

> no need to wait for Guido for this: adding library support for shared-
> memory dictionaries/lists is a no-brainer.  if you have experience in 
> this field, start hacking.  I'll take care of the rest ;-)

and no need to wait for Python 3000 either, of course -- I see no reason 
why this cannot go into some 2.X release.

</F>


From fredrik at pythonware.com  Wed Sep 20 16:03:39 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 20 Sep 2006 16:03:39 +0200
Subject: [Python-3000] Kill GIL? - to PEP 3099?
In-Reply-To: <eerg9j$pj8$1@sea.gmane.org>
References: <fb6fbf560609180856s1b63ded2o66061d453368a9@mail.gmail.com>	<451142FF.20203@solarsail.hcs.harvard.edu>
	<eerg9j$pj8$1@sea.gmane.org>
Message-ID: <eerhns$v1h$3@sea.gmane.org>

Fredrik Lundh wrote:

 > no need to wait for Guido for this: adding library support for shared-
 > memory dictionaries/lists is a no-brainer.  if you have experience in
 > this field, start hacking.  I'll take care of the rest ;-)

and you don't need to wait for Python 3000 either, of course -- if done 
right, this would certainly fit into some future 2.X release.

</F>


From jcarlson at uci.edu  Wed Sep 20 17:50:25 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 20 Sep 2006 08:50:25 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
Message-ID: <20060920083244.0817.JCARLSON@uci.edu>


"Adam Olsen" <rhamph at gmail.com> wrote:
> Before we can decide on the internal representation of our unicode
> objects, we need to decide on their external interface.  My thoughts
> so far:

I believe the only options up for actual decision is what the internal
representation of a unicode object will be.  Utf-8 that is never changed? 
Utf-8 that is converted to ucs-2/4 on certain kinds of accesses? 
Latin-1/ucs-2/ucs-4 depending on code point content?  Always ucs-2/4,
depending on compiler switch?


> * Most transformation and testing methods (.lower(), .islower(), etc)
> can be copied directly from 2.x.  They require no special
> implementation to perform reasonably.

A decoding variant of these would be required if the underlying
representation of a particular string is not latin-1, ucs-2, or ucs-4.

Further, any rstrip/split/etc. methods need to scan/parse the entire
string in order to discover code point starts/ends when using a utf-*
variant as an internal encoding (except for utf-32, which has a constant
width per character).

Whether or not we choose to go with a varying internal representation 
(the latin-1/ucs-2/ucs-4 variant I have been suggesting), 


> * Indexing and slicing is the big issue.  Do we need constant-time
> integer slicing?  .find() could be changed to return a token that
> could be used as a constant-time offset.  Incrementing the token would
> have linear costs, but that's no big deal if the offsets are always
> small.

If by "constant-time integer slicing" you mean "find the start and end
memory offsets of a slice in constant time", I would say yes.

Generally, I think tokens (in unicode strings) are a waste of time and
implementation.  Giving each string a fixed-width per character allows
methods on those unicode strings to be far simpler in implementation.


> * Grapheme clusters, words, lines, other groupings, do we need/want
> ways to slice based on them too?

No.


> * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> want to support them?  Now would be the time.

This would imply a tree-based string, which Guido has specifically
stated would not happen.  Never mind that it would be a beast to
implement and maintain or that it would exclude the possibility for
offering the single-segment buffer interface, without reprocessing.

 - Josiah


From mcherm at mcherm.com  Wed Sep 20 18:27:56 2006
From: mcherm at mcherm.com (Michael Chermside)
Date: Wed, 20 Sep 2006 09:27:56 -0700
Subject: [Python-3000] Removing __del__
Message-ID: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com>

Giovanni Bajo writes:
> I believe my example is a good use case for __del__ with no good
> enough workaround, which was requested by Micheal in the original post. I
> believe that it would be a mistake to remove __del__ unless we provide a
> graceful alternative (and I don't consider the code above a graceful
> alternative). I still like the __close__ method being proposed.

Thank you!

This is exactly the kind of discussion that I was hoping to engender.
Let me see if I can make the case a little more effectively. First of
all, let clean up Jean-Paul's solution a little bit so it looks prettier
when used. Let's put the following code into a module:

----- deletions.py -----
import weakref

# Maintain a separate list so the weakrefs themselves
# won't be garbage collected.
on_del_callbacks = []


def on_del_invoke(obj, func, *args, **kwargs):
     """This sets up a callback to be executed when an object
     is finalized. It is similar to the old __del__ method but
     without some of the risks and limitations of that method.

     The first argument is an object to watch; the second is a
     callable. After the object being watched gets finalized,
     the callable will be invoked; arguments for this call can
     be provided after the callable.

     Please note that the callable must not be a bound method
     of the object being watched, and the object being watched
     must not be (or be refered to by) one of the arguments
     or else the object will never be garbage collected."""

     def callback(ref):
         on_del_callbacks.remove(ref)
         func(*args, **kwargs)

     on_del_callbacks.append(
         weakref.ref(obj, callback))
--- end deletions.py ---

Performance could be improved in minor ways (avoiding the O(n)
lookup cost in the remove() call; avoiding the need for a
separate function object for each callback; catching obvious
loops and raising an exception immediately to make it more
newbie-friendly), but this will do for discussion.

Using this, your original code:

> class Wrapper:
>     def __init__(self, *args):
>          self.handle = CAPI.init(*args)
>
>     def __del__(self, *args):
>           CAPI.close(self.handle)
>
>     def foo(self):
>           CAPI.foo(self.handle)

becomes this code:

   from deletions import on_del_invoke

   class Wrapper:
       def __init__(self, *args):
           self.handle = CAPI.init(*args)
           on_del_invoke(self, CAPI.close, self.handle)

       def foo(self):
           CAPI.foo(self.handle)

It's actually *fewer* lines this way, and I find it quite
readable. Furthermore, unlike the __del__ version it doesn't
break as soon as someone accidentally puts a Wrapper object
into a loop.

Working from this example, I'm not convinced that the price
of giving up __del__ is really all that high. (But please,
find another example to convince me!)

On the other side of the scales, here are some benefits that
we gain if we get rid of __del__:

   * Simpler GC code which is less likely to have obscure
     bugs that are incredibly difficult to track down. Less
     core developer time spent maintaining complex, fragile
     code.

   * No need to explain about keeping __del__ objects[1] out
     of reference loops. In exchange, we choose explain
     about not passing the object being monitored or
     anything that links to it as arguments to on_del_invoke.
     I find that preferable because: (1) it seems more
     intuitive to me that the callback musn't reference the
     object being finalized, (2) it requires reasoning about
     the call-site, not about all future uses of the object,
     and (3) if the programmer violates this rule then the
     disadvantage is that the objects become immortal -- which
     is true for ALL __del__ objects in loops today.

   * Programmers no longer have the ability to allow __del__
     to resurect the object being finalized. Technically,
     that's a disadvantage, not an advantage, but I honestly
     don't think anyone believes it's a good idea to write
     __del__ methods that resurect the object, so I'm happy
     to lose that ability.

-- Michael Chermside


[1] - I'm using "__del__ object" to mean an object that has
    a __del__ method.


From guido at python.org  Wed Sep 20 18:40:49 2006
From: guido at python.org (Guido van Rossum)
Date: Wed, 20 Sep 2006 09:40:49 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
Message-ID: <ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>

On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> Before we can decide on the internal representation of our unicode
> objects, we need to decide on their external interface.  My thoughts
> so far:

Let me cut this short. The external string API in Py3k should not
change or only very marginally so (like removing rarely used useless
APIs or adding a few new conveniences). The plan is to keep the 2.x
API that is supported (in 2.x) by both str and unicode, but merge the
twp string types into one. Anything else could be done just as easily
before or after Py3k.

OTOH, if you want to start to gather requirements for the bytes API,
now is the time.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From mcherm at mcherm.com  Wed Sep 20 18:48:10 2006
From: mcherm at mcherm.com (Michael Chermside)
Date: Wed, 20 Sep 2006 09:48:10 -0700
Subject: [Python-3000] Delayed reference counting idea
Message-ID: <20060920094810.2lpvw6n4pk4k44wc@login.werra.lunarpages.com>

Greg Ewing writes:
> A worry about that is whether the architecture required to
> allow pluggable GC implementations introduces inefficiencies
> of its own that would skew the results.

Bob Ippolito
> There's no need to worry about that in the case of PyPy. Those kinds
> of choices are made way before runtime, so there's no required
> indirection.


Someone who knows PyPy better than me feel free to chime in if I
get things wrong, but I *think* that when it happens well before
runtime, well before compile-time, more equivalent to "time at
which the interpreter is compiled". So if you have PyPy set up to
compile to C and use reference counting GC, then it generates
calls to INCR and DECR before and after variable accesses, but
if you have it set up to compile to the LLVM which has its own
tracing GC then it doesn't generate anything before and after
variable accesses.


Greg again:
> Even so, we're talking about machine-generated code rather
> than the sort of hand-crafting you need to get the best
> out of something critical like GC. There could still be
> room for inefficiencies.

Quite true. As further illustration Python's GC is written in
C and thus you can't get the kind of efficiency you might out
of hand-crafted assembly. Unless of course the machine generating
the code is actually smarter about optimization than the hand
that's crafting it, or if the two are close enough in performance
that we don't mind.

I don't think PyPy has anything to teach us about GC performance
*yet*, but I think their approach is quite promissing as a
platform for running this kind of experiment.

-- Michael Chermside


From jimjjewett at gmail.com  Wed Sep 20 19:09:14 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 20 Sep 2006 13:09:14 -0400
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <20060920083244.0817.JCARLSON@uci.edu>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
	<20060920083244.0817.JCARLSON@uci.edu>
Message-ID: <fb6fbf560609201009x121f261dsdbcee088a9255bbf@mail.gmail.com>

On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:

> "Adam Olsen" <rhamph at gmail.com> wrote:
> > Before we can decide on the internal representation of our unicode
> > objects, we need to decide on their external interface.  My thoughts
> > so far:

> I believe the only options up for actual decision is what the internal
> representation of a unicode object will be.

If I request string[4:7], what (format of string) will come back?

The same format as the original string?
A canonical format?
The narrowest possible for the new string?

When a recoding occurs, is that in addition to the original format, or
instead of?  (I think "in addition" would be useful, as we're likely
to need that original format back for output -- but it does waste
space when we don't need the original again.)

> Further, any rstrip/split/etc. methods need to scan/parse the entire
> string in order to discover code point starts/ends when using a utf-*
> variant as an internal encoding (except for utf-32, which has a constant
> width per character).

No.  That is true of some encodings, but not the UTF variants.  A byte
(or double-byte, for UTF-16) is unambiguous.

Within a specific encoding, each possible (byte or double-byte) value
represents at most one of

    a complete value
    the start of a multi-position value
    the continuation of a multi-position value

That said, string[47:-34] may need to parse the whole string, just to
count double-position characters.  (To be honest, I'm not sure even
then; for UTF-16 it might make sense to treat surrogates as
double-width characters.  Even for UTF-8, there might be a workaround
that speeds up the majority of strings.)

> Giving each string a fixed-width per character allows
> methods on those unicode strings to be far simpler in implementation.

Which is why that was done in Py 2K.  The question for Py3K is

    Should we *commit* to this particular representation and allow
direct access to the internals?

    Or should we treat the internals as opaque, and allow more
efficient representations if someone wants to write one.

Today, I can go ahead and write my own string representation, but if I
change the internal storage, I can't actually use it with most
compiled extensions.

> > * Grapheme clusters, words, lines, other groupings, do we need/want
> > ways to slice based on them too?

> No.

I assume that you don't really mean strings will stop supporting split()

> > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> > want to support them?  Now would be the time.

> This would imply a tree-based string,

Cheap slicing wouldn't.
Cheap concatenation in *all* cases would.
Cheap concatenation in a few lucky cases wouldn't.

> it would exclude the possibility for
> offering the single-segment buffer interface, without reprocessing.

I'm not sure exactly what you mean here.  If you just mean "C code
can't get at the internals without warning", then that is true.

It is also true that any function requesting the internals would need
to either get the encoding along with it, or work with bytes.

If the C code wants that buffer in a specific encoding, it will have
to request that, which might well require reprocessing.  But if so,
then this recoding already happens today -- it is just that today, we
do it for every string, instead of only the ones that need it.  (But
today, the recoding happens earlier, which can be better for
debugging.)

-jJ

From rhamph at gmail.com  Wed Sep 20 19:47:39 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Wed, 20 Sep 2006 11:47:39 -0600
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <20060920083244.0817.JCARLSON@uci.edu>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
	<20060920083244.0817.JCARLSON@uci.edu>
Message-ID: <aac2c7cb0609201047g399d095ck7f64a367764b520f@mail.gmail.com>

On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Adam Olsen" <rhamph at gmail.com> wrote:
> > Before we can decide on the internal representation of our unicode
> > objects, we need to decide on their external interface.  My thoughts
> > so far:
>
> I believe the only options up for actual decision is what the internal
> representation of a unicode object will be.  Utf-8 that is never changed?
> Utf-8 that is converted to ucs-2/4 on certain kinds of accesses?
> Latin-1/ucs-2/ucs-4 depending on code point content?  Always ucs-2/4,
> depending on compiler switch?

Just a minor nit.  I doubt we could accept UCS-2, we'd want UTF-16
instead, with all the variable-width goodness that brings in.

Or maybe not so minor.  Old versions of windows used UCS-2, new
versions use UTF-16.  The former should get errors if too high of a
character is used, the latter will need conversion if we're not using
UTF-16.


> > * Most transformation and testing methods (.lower(), .islower(), etc)
> > can be copied directly from 2.x.  They require no special
> > implementation to perform reasonably.
>
> A decoding variant of these would be required if the underlying
> representation of a particular string is not latin-1, ucs-2, or ucs-4.

That makes no sense.  They can operate on any encoding we design them
to.  The cost is always O(n) with the length of the string.


> Further, any rstrip/split/etc. methods need to scan/parse the entire
> string in order to discover code point starts/ends when using a utf-*
> variant as an internal encoding (except for utf-32, which has a constant
> width per character).

See below.


> Whether or not we choose to go with a varying internal representation
> (the latin-1/ucs-2/ucs-4 variant I have been suggesting),
>
>
> > * Indexing and slicing is the big issue.  Do we need constant-time
> > integer slicing?  .find() could be changed to return a token that
> > could be used as a constant-time offset.  Incrementing the token would
> > have linear costs, but that's no big deal if the offsets are always
> > small.
>
> If by "constant-time integer slicing" you mean "find the start and end
> memory offsets of a slice in constant time", I would say yes.
>
> Generally, I think tokens (in unicode strings) are a waste of time and
> implementation.  Giving each string a fixed-width per character allows
> methods on those unicode strings to be far simpler in implementation.

s = 'foobar'
p = s[s.find('bar'):] == 'bar'

Even if .find() is made to return a token, rather than an integer, the
behavior and performance of this example are unchanged.

However, I can imagine there might be use cases, such as the .find()
output on one string being used to slice a different string, which
tokens wouldn't support.  I haven't been able to dream up any sane
examples, which is why I asked about it here.  I want to see specific
examples showing that tokens won't work.

Using only utf-8 would be simpler than three distinct representations.
 And if memory usage is an issue (which it seems to be, albeit in a
vague way), we could make a custom encoding that's even simpler and
more space efficient than utf-8.


> > * Grapheme clusters, words, lines, other groupings, do we need/want
> > ways to slice based on them too?
>
> No.

Can you explain your reasoning?


> > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> > want to support them?  Now would be the time.
>
> This would imply a tree-based string, which Guido has specifically
> stated would not happen.  Never mind that it would be a beast to
> implement and maintain or that it would exclude the possibility for
> offering the single-segment buffer interface, without reprocessing.

The only reference I found was this:
http://mail.python.org/pipermail/python-3000/2006-August/003334.html

I interpret that as him being very sceptical, not an outright refusal.

Allowing external code to operate on a python string in-place seems
tenuous at best.  Even with three types (Latin-1, UCS-2, UCS-4) you
would need to automatically copy and convert if the wrong type is
given.

-- 
Adam Olsen, aka Rhamphoryncus

From rhamph at gmail.com  Wed Sep 20 20:20:13 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Wed, 20 Sep 2006 12:20:13 -0600
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
	<ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>
Message-ID: <aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>

On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > Before we can decide on the internal representation of our unicode
> > objects, we need to decide on their external interface.  My thoughts
> > so far:
>
> Let me cut this short. The external string API in Py3k should not
> change or only very marginally so (like removing rarely used useless
> APIs or adding a few new conveniences). The plan is to keep the 2.x
> API that is supported (in 2.x) by both str and unicode, but merge the
> twp string types into one. Anything else could be done just as easily
> before or after Py3k.

Thanks, but one thing remains unclear: is the indexing intended to
represent bytes, code points, or code units?  Note that C code
operating on UTF-16 would use code units for slicing of UTF-16, which
splits surrogate pairs.

As far as I can tell, CPython on windows uses UTF-16 with code units.
Perhaps not intentionally, but by default (not throwing an error on
surrogates).

For those trying to make sense of this, a Code Point anything in the 0
to 0x10FFFF range.  A Code Unit goes up to 0xFF for UTF-8, 0xFFFF for
UTF-16, and 0xFFFFFFFF for UTF-32.  One or more code units may be
needed to form a single code point.  Obviously code units expose our
internal implementation choice.

-- 
Adam Olsen, aka Rhamphoryncus

From brett at python.org  Wed Sep 20 20:30:28 2006
From: brett at python.org (Brett Cannon)
Date: Wed, 20 Sep 2006 11:30:28 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
	<ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>
	<aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>
Message-ID: <bbaeab100609201130q793247e4p676d9e00b15f9038@mail.gmail.com>

On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
>
> On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> > On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > > Before we can decide on the internal representation of our unicode
> > > objects, we need to decide on their external interface.  My thoughts
> > > so far:
> >
> > Let me cut this short. The external string API in Py3k should not
> > change or only very marginally so (like removing rarely used useless
> > APIs or adding a few new conveniences). The plan is to keep the 2.x
> > API that is supported (in 2.x) by both str and unicode, but merge the
> > twp string types into one. Anything else could be done just as easily
> > before or after Py3k.
>
> Thanks, but one thing remains unclear: is the indexing intended to
> represent bytes, code points, or code units?  Note that C code
> operating on UTF-16 would use code units for slicing of UTF-16, which
> splits surrogate pairs.


Assuming my Unicode lingo is right and code point represents a
letter/character/digraph/whatever, then it will be a code point.  Doing one
of my rare channels of Guido, I *really* doubt he wants to expose the
technical details of Unicode to the point of having people need to realize
that UTF-8 takes two bytes to represent "?".  If you want that kind of
exposure, use the bytes type.  Otherwise assume the usage will be by people
ignorant of Unicode and thus want something that will work the way they are
used to when compared to working in ASCII.

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060920/a71a932e/attachment.htm 

From guido at python.org  Wed Sep 20 20:32:04 2006
From: guido at python.org (Guido van Rossum)
Date: Wed, 20 Sep 2006 11:32:04 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
	<ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>
	<aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>
Message-ID: <ca471dc20609201132h2828382bt5f47a5d62c6b2d92@mail.gmail.com>

On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> > On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > > Before we can decide on the internal representation of our unicode
> > > objects, we need to decide on their external interface.  My thoughts
> > > so far:
> >
> > Let me cut this short. The external string API in Py3k should not
> > change or only very marginally so (like removing rarely used useless
> > APIs or adding a few new conveniences). The plan is to keep the 2.x
> > API that is supported (in 2.x) by both str and unicode, but merge the
> > twp string types into one. Anything else could be done just as easily
> > before or after Py3k.
>
> Thanks, but one thing remains unclear: is the indexing intended to
> represent bytes, code points, or code units?

I don't see what's unclear -- the existing unicode object does what it does.

> Note that C code
> operating on UTF-16 would use code units for slicing of UTF-16, which
> splits surrogate pairs.

I thought we were discussing the Python API.

C code will likely have the same access to unicode objects as it has in 2.x.

> As far as I can tell, CPython on windows uses UTF-16 with code units.
> Perhaps not intentionally, but by default (not throwing an error on
> surrogates).

This is intentional, to be compatible with the rest of that platform.
Jython and IronPython do this too I believe.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rhamph at gmail.com  Wed Sep 20 20:43:03 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Wed, 20 Sep 2006 12:43:03 -0600
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ca471dc20609201132h2828382bt5f47a5d62c6b2d92@mail.gmail.com>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
	<ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>
	<aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>
	<ca471dc20609201132h2828382bt5f47a5d62c6b2d92@mail.gmail.com>
Message-ID: <aac2c7cb0609201143u29398cbbr247de08db5d1fcbb@mail.gmail.com>

On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> > > On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > > > Before we can decide on the internal representation of our unicode
> > > > objects, we need to decide on their external interface.  My thoughts
> > > > so far:
> > >
> > > Let me cut this short. The external string API in Py3k should not
> > > change or only very marginally so (like removing rarely used useless
> > > APIs or adding a few new conveniences). The plan is to keep the 2.x
> > > API that is supported (in 2.x) by both str and unicode, but merge the
> > > twp string types into one. Anything else could be done just as easily
> > > before or after Py3k.
> >
> > Thanks, but one thing remains unclear: is the indexing intended to
> > represent bytes, code points, or code units?
>
> I don't see what's unclear -- the existing unicode object does what it does.

The existing unicode object doesn't expose the difference between them
except when UTF-16 is used and surrogates exist.


> > Note that C code
> > operating on UTF-16 would use code units for slicing of UTF-16, which
> > splits surrogate pairs.
>
> I thought we were discussing the Python API.
>
> C code will likely have the same access to unicode objects as it has in 2.x.

I only mentioned it because C doesn't mind exposing the internal
details for performance benefits, whereas python usually does mind.


> > As far as I can tell, CPython on windows uses UTF-16 with code units.
> > Perhaps not intentionally, but by default (not throwing an error on
> > surrogates).
>
> This is intentional, to be compatible with the rest of that platform.
> Jython and IronPython do this too I believe.

So you're saying we should use code units?!  Or are you referring to
the choice of UTF-16?

I would expect us to use code points in 3.x, but that's not how it is in 2.x.

-- 
Adam Olsen, aka Rhamphoryncus

From jimjjewett at gmail.com  Wed Sep 20 21:04:23 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 20 Sep 2006 15:04:23 -0400
Subject: [Python-3000] Removing __del__
In-Reply-To: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com>
References: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com>
Message-ID: <fb6fbf560609201204u2cb14ba1qb15854fd8444bc98@mail.gmail.com>

On 9/20/06, Michael Chermside <mcherm at mcherm.com> wrote:
> Giovanni Bajo writes:
> > I believe my example is a good use case for __del__ with no good
> > enough workaround, ... I still like the __close__ method being proposed.

[Michael asks about this alternative]
...
> def on_del_invoke(obj, func, *args, **kwargs):
...
>      Please note that the callable must not be a bound method
>      of the object being watched, and the object being watched
>      must not be (or be refered to by) one of the arguments
>      or else the object will never be garbage collected."""

By far the most frequently desired callable is self.close

You can work around this with a wrapper, by setting self.f=open(...)
and then passing self.f.close -- but with this API, I'll be wondering
why I can't just register self.f as the object in the first place.

If bound methods did not increment the refcount, this would work, but
I imagine it would break various GUI and event-processing idioms.

A special rebind-this-method-weakly builtin would work, but I'm not
sure that is any simpler than __close__.  (~= __del__ but cycles can
be broken in an arbitrary order)

> Using this, your original code:

> > class Wrapper:
> >     def __init__(self, *args):
> >          self.handle = CAPI.init(*args)

> >     def __del__(self, *args):
> >           CAPI.close(self.handle)

> >     def foo(self):
> >           CAPI.foo(self.handle)

> becomes this code:

>    from deletions import on_del_invoke

>    class Wrapper:
>        def __init__(self, *args):
>            self.handle = CAPI.init(*args)
>            on_del_invoke(self, CAPI.close, self.handle)

>        def foo(self):
>            CAPI.foo(self.handle)

Note that the wrapper (as posted) does nothing except store a pointer
to the CAPI object and then delegate to it.  With a __close__ method,
this class could reduce to (at most)

    class MyCAPI(CAPI):
        __close__ = close

Since the CAPI class could use the __close__ convention directly, the
wrapper could be eliminated entirely.  (In real life, his class might
do more ... but if so, then *these* lines are still boilerplate that
it would be good to remove).

> On the other side of the scales, here are some benefits that
> we gain if we get rid of __del__:

>    * No need to explain about keeping __del__ objects[1] out
>      of reference loops. In exchange, we choose explain
>      about not passing the object being monitored or
>      anything that links to it as arguments to on_del_invoke.

Adding an extra wrapper just to avoid passing self isn't really any
better than adding an extra cleanup object hanging off an attribute to
avoid loops.  So the explanation might be better, but the resulting
code would end up using the same workarounds that are recommended (but
often not used) today.

>     (3) if the programmer violates this rule then the
>      disadvantage is that the objects become immortal -- which
>      is true for ALL __del__ objects in loops today.

But most objects are not in a __del__ loop.  By passing a bound
method, the user makes the object immortal even if it is the only
object that needs cleanup.

>    * Programmers no longer have the ability to allow __del__
>      to resurect the object being finalized. Technically,
>      that's a disadvantage, not an advantage, but I honestly
>      don't think anyone believes it's a good idea to write
>      __del__ methods that resurect the object, so I'm happy
>      to lose that ability.

How do you feel about the __del__ in stdlib subprocess.Popen (about line 615)?

This resurrects itself, in order to finish waiting for the child
process.  If the child isn't done yet, then it will check again the
next time a new Popen is created (or at final closedown).  Without
this ability to reschedule itself, it would have to do a blocking
wait, which might put some odd pressures on concurrency.

(And note that if it needed to revive (not recreate, revive)
subobjects, it would need the full immortal-cycle power of today's
__del__.  It may be valid not to support this case, but it isn't
automatically bad usage.)

-jJ

From jimjjewett at gmail.com  Wed Sep 20 22:59:22 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Wed, 20 Sep 2006 16:59:22 -0400
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ca471dc20609201132h2828382bt5f47a5d62c6b2d92@mail.gmail.com>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
	<ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>
	<aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>
	<ca471dc20609201132h2828382bt5f47a5d62c6b2d92@mail.gmail.com>
Message-ID: <fb6fbf560609201359v368ac15ey189165db957e9dff@mail.gmail.com>

On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > On 9/20/06, Guido van Rossum <guido at python.org> wrote:

> > > Let me cut this short. The external string API in Py3k should not
> > > change or only very marginally so (like removing rarely used useless
> > > APIs or adding a few new conveniences).

...

> I thought we were discussing the Python API.

I don't think anyone has proposed much change to strings *as seen from
python*.

At most, there has been an implicit suggestion that the
bytes.decode().encode() dance be shortened.

> C code will likely have the same access to unicode objects as it has in 2.x.

Can C code still assume that

     (1)  the data buffer will always be available for any sort of
direct manipulation (including mutation)

     (2)  in a specific canonical encoding

     (3)  directly from the memory layout, without calling a "prepare"
or "recode" or "encode" method first.

Today, that canonical encoding is a compile-time choice, and any
specific choice causes integration hassles.

Unless the choice matches the system default for text, it also
requires many decode/encode round trips that might otherwise be
avoided.

The proposed changes mostly boil down to removing the third
assumption, and agreeing that some implementations might delay the
decode-to-canonical-format until it was needed.


Rough Summary of new C API restrictions:

Replace
    ((PyStringObject *)string).ob_sval    /* supported today */
with
    PyString_AsString(string)                 /* already recommended */

or replace
    ((PyUnicodeObject *)string)->str       /* supported today */
and
    ((PyUnicodeObject *)string)->defenc    /* supported today */

with
    PyUnicode_AsEncodedString(PyObject *unicode,   /* already recommended */
                              const char *encoding,
                              const char *errors)
and
    PyUnicode_AsAnyString(PyObject *unicode,      /* new */
                          char **encoding,   /* return the actual encoding */
                          const char *errors)

Also note that some macros would need to become functions.  The most
prominent is

    PyUnicode_AS_DATA(string)         /* supports mutation */

-jJ

From jcarlson at uci.edu  Wed Sep 20 23:20:22 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 20 Sep 2006 14:20:22 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <fb6fbf560609201009x121f261dsdbcee088a9255bbf@mail.gmail.com>
References: <20060920083244.0817.JCARLSON@uci.edu>
	<fb6fbf560609201009x121f261dsdbcee088a9255bbf@mail.gmail.com>
Message-ID: <20060920135601.0822.JCARLSON@uci.edu>


"Jim Jewett" <jimjjewett at gmail.com> wrote:
> On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> 
> > "Adam Olsen" <rhamph at gmail.com> wrote:
> > > Before we can decide on the internal representation of our unicode
> > > objects, we need to decide on their external interface.  My thoughts
> > > so far:
> 
> > I believe the only options up for actual decision is what the internal
> > representation of a unicode object will be.
> 
> If I request string[4:7], what (format of string) will come back?
> 
> The same format as the original string?
> A canonical format?
> The narrowest possible for the new string?

Which of the three depend on the choice of internal representation.  If
the internal representation is always canonical, narrowest, or same as
the original string, then it would be one of those.


> When a recoding occurs, is that in addition to the original format, or
> instead of?  (I think "in addition" would be useful, as we're likely
> to need that original format back for output -- but it does waste
> space when we don't need the original again.)

The current implementation, I believe, uses "in addition", unless I'm
misreading the unicode string struct.


> > Further, any rstrip/split/etc. methods need to scan/parse the entire
> > string in order to discover code point starts/ends when using a utf-*
> > variant as an internal encoding (except for utf-32, which has a constant
> > width per character).
> 
> No.  That is true of some encodings, but not the UTF variants.  A byte
> (or double-byte, for UTF-16) is unambiguous.

I was under the impression that utf-8 was a particular kind of prefix
encoding.  Looking at the actual output of utf-8, I notice that the
encodings are such that bytes with value >= 0xc0 define the beginning of
the multi-character encodings, so handling 'from the front' or 'from the
back' are equivalently as reasonable.


> That said, string[47:-34] may need to parse the whole string, just to
> count double-position characters.  (To be honest, I'm not sure even
> then; for UTF-16 it might make sense to treat surrogates as
> double-width characters.  Even for UTF-8, there might be a workaround
> that speeds up the majority of strings.)

It would involve keeping some sort of cache of indices/offset values. 
This may not be worthwhile.


> > Giving each string a fixed-width per character allows
> > methods on those unicode strings to be far simpler in implementation.
> 
> Which is why that was done in Py 2K.  The question for Py3K is
> 
>     Should we *commit* to this particular representation and allow
> direct access to the internals?

Why not?

>     Or should we treat the internals as opaque, and allow more
> efficient representations if someone wants to write one.

I'm not sure that the efficiencies are necessarily desireable.

> Today, I can go ahead and write my own string representation, but if I
> change the internal storage, I can't actually use it with most
> compiled extensions.

Right, but extensions that are used *right now* would need to be
rewritten to handle these "more efficient" representations.


> > > * Grapheme clusters, words, lines, other groupings, do we need/want
> > > ways to slice based on them too?
> 
> > No.
> 
> I assume that you don't really mean strings will stop supporting split()

That would be silly.  What I meant was that text.word[7], text.line[3],
etc., shouldn't mean anything on the base implementation.


> > > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> > > want to support them?  Now would be the time.
> 
> > This would imply a tree-based string,
> 
> Cheap slicing wouldn't.

O(logn) would imply a tree-based string.  O(1) would imply slicing on
text returning views (which I'm not even advocating, and I'm a view
proponent).

> Cheap concatenation in *all* cases would.
> Cheap concatenation in a few lucky cases wouldn't.

Presumably one would need to copy data from one to the other, so that
would O(n) with a non-tree version.


> > it would exclude the possibility for
> > offering the single-segment buffer interface, without reprocessing.
> 
> I'm not sure exactly what you mean here.  If you just mean "C code
> can't get at the internals without warning", then that is true.

The single-segment buffer interface is, not uncommonly, how C extensions
get at the content of strings, unicode, array, mmap, etc.  Technically
speaking, the current implementation of str and unicode use an internal
variant to gain access to their own internals for processing.


> It is also true that any function requesting the internals would need
> to either get the encoding along with it, or work with bytes.

Or code points...  The point of specifying the character width as 1,2 or
4 bytes, would be that one can iterate over chars, shorts, or ints.


> If the C code wants that buffer in a specific encoding, it will have
> to request that, which might well require reprocessing.  But if so,
> then this recoding already happens today -- it is just that today, we
> do it for every string, instead of only the ones that need it.  (But
> today, the recoding happens earlier, which can be better for
> debugging.)

Indeed.  But it's not just for C extensions, it's for Python's own
string/unicode internals.  Simple is better than complex.  Having a flat
array-based implementation is simple, and allows us to re-use the vast
majority of code we already have.

 - Josiah


From jcarlson at uci.edu  Wed Sep 20 23:59:22 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 20 Sep 2006 14:59:22 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <aac2c7cb0609201047g399d095ck7f64a367764b520f@mail.gmail.com>
References: <20060920083244.0817.JCARLSON@uci.edu>
	<aac2c7cb0609201047g399d095ck7f64a367764b520f@mail.gmail.com>
Message-ID: <20060920142332.0825.JCARLSON@uci.edu>


"Adam Olsen" <rhamph at gmail.com> wrote:
> 
> On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > "Adam Olsen" <rhamph at gmail.com> wrote:
> > > Before we can decide on the internal representation of our unicode
> > > objects, we need to decide on their external interface.  My thoughts
> > > so far:
> >
> > I believe the only options up for actual decision is what the internal
> > representation of a unicode object will be.  Utf-8 that is never changed?
> > Utf-8 that is converted to ucs-2/4 on certain kinds of accesses?
> > Latin-1/ucs-2/ucs-4 depending on code point content?  Always ucs-2/4,
> > depending on compiler switch?
> 
> Just a minor nit.  I doubt we could accept UCS-2, we'd want UTF-16
> instead, with all the variable-width goodness that brings in.

If we are opting for a *single* internal representation, then UTF-16 or
UTF-32 are really the only options.

> > > * Most transformation and testing methods (.lower(), .islower(), etc)
> > > can be copied directly from 2.x.  They require no special
> > > implementation to perform reasonably.
> >
> > A decoding variant of these would be required if the underlying
> > representation of a particular string is not latin-1, ucs-2, or ucs-4.
> 
> That makes no sense.  They can operate on any encoding we design them
> to.  The cost is always O(n) with the length of the string.

I was thinking .startswith() and .endswith(), but assuming *some*
canonical representation (UTF-16, UTF-32, etc.) this is trivial to
implement.  I take back my concerns on this particular point.


> > Whether or not we choose to go with a varying internal representation
> > (the latin-1/ucs-2/ucs-4 variant I have been suggesting),
> >
> >
> > > * Indexing and slicing is the big issue.  Do we need constant-time
> > > integer slicing?  .find() could be changed to return a token that
> > > could be used as a constant-time offset.  Incrementing the token would
> > > have linear costs, but that's no big deal if the offsets are always
> > > small.
> >
> > If by "constant-time integer slicing" you mean "find the start and end
> > memory offsets of a slice in constant time", I would say yes.
> >
> > Generally, I think tokens (in unicode strings) are a waste of time and
> > implementation.  Giving each string a fixed-width per character allows
> > methods on those unicode strings to be far simpler in implementation.
> 
> However, I can imagine there might be use cases, such as the .find()
> output on one string being used to slice a different string, which
> tokens wouldn't support.  I haven't been able to dream up any sane
> examples, which is why I asked about it here.  I want to see specific
> examples showing that tokens won't work.

    p = s[6:-6]

Or even in actual code I use today:

    p = s.lstrip()
    lil = len(s) - len(p)
    si = s[:lil]
    lil += si.count('\t')*(self.GetTabWidth()-1)
    
    #s is the original line
    #p is the line without leading indentation
    #si is the line indentation characters
    #lil is the indentation of the line in columns

If I can't slice based on character index, then we end up with a similar
situation that the wxPython StyledTextCtrl runs into right now: the
content is encoded via utf-8 internally, so users have to use the fairly
annoying PositionBefore(pos) and PositionAfter(pos) methods to discover
where characters start/end.  While it is possible to handle everything
this way, it is *damn annoying*, and some users have gone so far as to
say that it *doesn't work* for Europeans.

While I won't make the claim that it *doesn't work*, it is a pain in the
ass.


> Using only utf-8 would be simpler than three distinct representations.
>  And if memory usage is an issue (which it seems to be, albeit in a
> vague way), we could make a custom encoding that's even simpler and
> more space efficient than utf-8.

One of the reasons I've been pushing for the 3 representations is
because it is (arguably) optimal for any particular string.


> > > * Grapheme clusters, words, lines, other groupings, do we need/want
> > > ways to slice based on them too?
> >
> > No.
> 
> Can you explain your reasoning?

We can already split based on words, lines, etc., usingsplit(), and
re.split().  Building additional functionality for text.word[4] seems to
be a waste of time.


> > > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> > > want to support them?  Now would be the time.
> >
> > This would imply a tree-based string, which Guido has specifically
> > stated would not happen.  Never mind that it would be a beast to
> > implement and maintain or that it would exclude the possibility for
> > offering the single-segment buffer interface, without reprocessing.
> 
> The only reference I found was this:
> http://mail.python.org/pipermail/python-3000/2006-August/003334.html
> 
> I interpret that as him being very sceptical, not an outright refusal.
> 
> Allowing external code to operate on a python string in-place seems
> tenuous at best.  Even with three types (Latin-1, UCS-2, UCS-4) you
> would need to automatically copy and convert if the wrong type is
> given.

The only benefits that utf-8 gains over any other internal
representation is that it is an arguably minimal-sized representation,
and it is commonly used among other C libraries.

The benefits gained by using the three internal representations are
primarily from a simplicity standpoint.  That is to say, when
manipulating any one of the three representations, you know that the
value at offset X represents the code point of character X in the string.

Further, with a slight change in how the single-segment buffer interface
is defined (returns the width of the character), C extensions that want
to deal with unicode strings in *native* format (due to concerns about
speed), could do so without having to worry about reencoding,
variable-width characters, etc.

You can get this same behavior by always using UTF-32 (aka UCS-4), but
at least 1/4 of the underlying data is always going to be nulls (code
points are limited to 0x0010ffff), and for many people (in Europe, the
US, and anywhere else with code points < 65536), 1/2 to 3/4 of the
underlying data is going to be nulls.

While I would imagine that people could deal with UTF-16 as an
underlying representation (from a data waste perspective), the potential
for varying-width characters in such an encoding is a pain in the ass
(like it is for UTF-8).

Regardless of our choice, *some platform* is going to be angry.  Why? 
GTK takes utf-8 encoded strings.  (I don't know what Qt or linux system
calls take) Windows takes utf-16. Whatever underlying representation,
*someone* is going to have to recode when dealing with GUI or OS-level
operations.


 - Josiah


From qrczak at knm.org.pl  Thu Sep 21 00:34:40 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Thu, 21 Sep 2006 00:34:40 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <20060920142332.0825.JCARLSON@uci.edu> (Josiah Carlson's
	message of "Wed, 20 Sep 2006 14:59:22 -0700")
References: <20060920083244.0817.JCARLSON@uci.edu>
	<aac2c7cb0609201047g399d095ck7f64a367764b520f@mail.gmail.com>
	<20060920142332.0825.JCARLSON@uci.edu>
Message-ID: <87ac4ui2v3.fsf@qrnik.zagroda>

Josiah Carlson <jcarlson at uci.edu> writes:

> Regardless of our choice, *some platform* is going to be angry.  Why? 
> GTK takes utf-8 encoded strings.  (I don't know what Qt or linux system
> calls take) Windows takes utf-16.

The representation of QChar in Qt-3.3.5:

    ushort ucs;
#if defined(QT_QSTRING_UCS_4)
    ushort grp;
#endif

The representation of QStringData in Qt-3.3.5:

    QChar *unicode;
    char *ascii;
#ifdef Q_OS_MAC9
    uint len;
#else
    uint len : 30;
#endif
    uint issimpletext : 1;
#ifdef Q_OS_MAC9
    uint maxl;
#else
    uint maxl : 30;
#endif
   uint islatin1 : 1;

I would say that it's silly. It seems a transition from UCS-2 to UCS-4
in Qt is incomplete. Almost no code is prepared for QT_QSTRING_UCS_4.
For example the implementation of a function which explains what
issimpletext means:

void QString::checkSimpleText() const
{
    QChar *p = d->unicode;
    QChar *end = p + d->len;
    while ( p < end ) {
        ushort uc = p->unicode();
        // sort out regions of complex text formatting
        if ( uc > 0x058f && ( uc < 0x1100 || uc > 0xfb0f ) ) {
            d->issimpletext = FALSE;
            return;
        }
        p++;
    }
    d->issimpletext = TRUE;
}

QChar documentation says:

   Unicode  characters are (so far) 16-bit entities without any markup or
   structure. This class represents such an entity. It is lightweight, so
   it can be used everywhere. Most compilers treat it like a "short int".
   (In  a  few  years  it may be necessary to make QChar 32-bit when more
   than 65536 Unicode code points have been defined and come into use.)

Bleh...

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From rasky at develer.com  Thu Sep 21 00:39:48 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Thu, 21 Sep 2006 00:39:48 +0200
Subject: [Python-3000] Removing __del__
References: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com>
Message-ID: <000f01c6dd05$ae08b8e0$7b4b2597@bagio>

Michael Chermside <mcherm at mcherm.com> wrote:

>    from deletions import on_del_invoke
>
>    class Wrapper:
>        def __init__(self, *args):
>            self.handle = CAPI.init(*args)
>            on_del_invoke(self, CAPI.close, self.handle)
>
>        def foo(self):
>            CAPI.foo(self.handle)
>
> It's actually *fewer* lines this way, and I find it quite
> readable.

It's fewer lines, but *less* readable than a simple plain method call. It's
still an indirection.

> Furthermore, unlike the __del__ version it doesn't
> break as soon as someone accidentally puts a Wrapper object
> into a loop.

Yes, but I'm an adult and I know that it won't. I'm not even touching __del__
with a hundred foot pole if it's a class which has 1% of possibility of getting
into a loop, really. I know it will always be a "leaf" class, if you know what
I mean.

> Working from this example, I'm not convinced that the price
> of giving up __del__ is really all that high.

If you ask me, I don't think I can find any library solution to finalization
acceptable. Finalization is really something that ought to be easy. If the
cyclic GC and __del__ doesn't get along well together, let's substitute __del__
with another finalization feature in the core, with an easy syntax and
semantic, which can cope better with the cyclic GC. Again, I vote for the
__close__ method (which is: just fix the semantic).

> (But please,
> find another example to convince me!)

Let's say:

class Wrapper:
    def __init__(self, *args):
        self.handle = CAPI.init(*args)

    def close(self):
        if self.handle is not None:
            CAPI.close(self.handle)
            self.handle = None
    __del__ = close

Now what, remove_on_del_invokation()?

> On the other side of the scales, here are some benefits that
> we gain if we get rid of __del__:
>
>    * Simpler GC code which is less likely to have obscure
>      bugs that are incredibly difficult to track down. Less
>      core developer time spent maintaining complex, fragile
>      code.

This is an argument against the current semantic of __del__, not against any
finalization method which is invoked during the cyclic GC. I believe that
__close__ fixes these problems as well.

>    * No need to explain about keeping __del__ objects[1] out
>      of reference loops. In exchange, we choose explain
>      about not passing the object being monitored or
>      anything that links to it as arguments to on_del_invoke.
>      I find that preferable because: [...]

I think you are right in that the latter is preferable, but I think it's even
easier to just "avoid __del__ when coding, unless you are dramatically sure of
what you're doing". This way, you don't have to keep mental reference counts.

In fact, I believe we're missing a valuable tool for Python 2. Wouldn't be
possible to have a debug mode where, between each statement (or very often at
least), Python looks for cycles with __del__ in them, and abort execution? It
would be very useful to early detect uncollectable cycles at the moment they
are created, instead of doing long sessions trying to parse gc.garbage.

Giovanni Bajo


From mcherm at mcherm.com  Thu Sep 21 00:41:15 2006
From: mcherm at mcherm.com (Michael Chermside)
Date: Wed, 20 Sep 2006 15:41:15 -0700
Subject: [Python-3000] How will unicode get used?
Message-ID: <20060920154115.4wy2hnw6cnc4gkgw@login.werra.lunarpages.com>

Guido writes:
> > As far as I can tell, CPython on windows uses UTF-16 with code units.
> > Perhaps not intentionally, but by default (not throwing an error on
> > surrogates).
>
> This is intentional, to be compatible with the rest of that platform.
> Jython and IronPython do this too I believe.

The following code illustrates this:

>>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.'
>>> msg[35:-18]
u'"\U00010143"'
>>> greek_five = msg[36:-19]
>>> len(greek_five)
2
>>> greek_five[0]
u'\ud800'
>>> greek_five[1]
u'\udd43'

The single unicode character greek_five, when expressed as a string
in CPython has length of 2 and can be sliced into two separate
characters. In Jython, the code above will not work because Jython
doesn't currently support \U or extended unicode (but someday that
may change). I'm not sure about IronPython.

So if I understand Guido's point, he's saying that it is on purpose
that len(greek_five) == 2. That's useful for compatibility today
with the Java and Microsoft VM platforms. But it's not particularly
compatible with extended Unicode. (Technically it doesn't violate
any rules so long as it's clearly defined that a character in Python
is NOT the same as a unicode code point.)

I wonder if it would be better to say that len(greek_five) is
undefined in Python. (And obviously slicing behavior follows from
len behavior.) There are excellent reasons for CPython to return
2 in the near future, but the far future is less clear. And the
Jython and Iron Python will be constrained by common sense to do
whatever their underlying platform does, even if that changes in
the future.

Designing these things would be a lot easier if we had a time
machine so we could go see how extended Unicode is used in practice
a decade or two from now.

Oh, wait....

-- Michael Chermside


From mcherm at mcherm.com  Thu Sep 21 00:46:55 2006
From: mcherm at mcherm.com (Michael Chermside)
Date: Wed, 20 Sep 2006 15:46:55 -0700
Subject: [Python-3000] How will unicode get used?
Message-ID: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>

I wrote:
>>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.'
>>> msg[35:-18]
u'"\U00010143"'
>>> greek_five = msg[36:-19]
>>> len(greek_five)
2


After posting, I realized that it's worse than that. I suspect that if
I tried this on a CPython compiled with wide characters, then
len(greek_five) would be 1.

What should it be? 2? 1? Implementation-dependent?

-- Michael Chermside


From guido at python.org  Thu Sep 21 00:52:16 2006
From: guido at python.org (Guido van Rossum)
Date: Wed, 20 Sep 2006 15:52:16 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>
Message-ID: <ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>

On 9/20/06, Michael Chermside <mcherm at mcherm.com> wrote:
> I wrote:
> >>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.'
> >>> msg[35:-18]
> u'"\U00010143"'
> >>> greek_five = msg[36:-19]
> >>> len(greek_five)
> 2
>
>
> After posting, I realized that it's worse than that. I suspect that if
> I tried this on a CPython compiled with wide characters, then
> len(greek_five) would be 1.
>
> What should it be? 2? 1? Implementation-dependent?

This has all been rehashed endlessly. It's implementation (and
platform- and compilation options-) dependent because there are good
reasons for both choices. Even if CPython 3.0 supports a dynamic
choice (which some are proposing) then the *language* will still make
it implementation dependent because of Jython and IronPython, where
the only choice is UTF-16 (or UCS-2, depending the attitude towards
surrogates).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rhamph at gmail.com  Thu Sep 21 00:52:38 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Wed, 20 Sep 2006 16:52:38 -0600
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <20060920142332.0825.JCARLSON@uci.edu>
References: <20060920083244.0817.JCARLSON@uci.edu>
	<aac2c7cb0609201047g399d095ck7f64a367764b520f@mail.gmail.com>
	<20060920142332.0825.JCARLSON@uci.edu>
Message-ID: <aac2c7cb0609201552q605c20e0od3df500d4a13dff0@mail.gmail.com>

On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Adam Olsen" <rhamph at gmail.com> wrote:
> >
> > On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> > >
> > > "Adam Olsen" <rhamph at gmail.com> wrote:

[snip token stuff]

Withdrawn.  Blake Winston pointed me to some problems in private as well.


> If I can't slice based on character index, then we end up with a similar
> situation that the wxPython StyledTextCtrl runs into right now: the
> content is encoded via utf-8 internally, so users have to use the fairly
> annoying PositionBefore(pos) and PositionAfter(pos) methods to discover
> where characters start/end.  While it is possible to handle everything
> this way, it is *damn annoying*, and some users have gone so far as to
> say that it *doesn't work* for Europeans.
>
> While I won't make the claim that it *doesn't work*, it is a pain in the
> ass.

I'm going to agree with you.  That's also why I'm going to assume
Guido meant to use Code Points, not Code Units (which would be bytes
in the case of UTF-8).


> > Using only utf-8 would be simpler than three distinct representations.
> >  And if memory usage is an issue (which it seems to be, albeit in a
> > vague way), we could make a custom encoding that's even simpler and
> > more space efficient than utf-8.
>
> One of the reasons I've been pushing for the 3 representations is
> because it is (arguably) optimal for any particular string.

It bothers me that adding a single character would cause it to double
or quadruple in size.  May be the best compromise though.


> > > > * Grapheme clusters, words, lines, other groupings, do we need/want
> > > > ways to slice based on them too?
> > >
> > > No.
> >
> > Can you explain your reasoning?
>
> We can already split based on words, lines, etc., usingsplit(), and
> re.split().  Building additional functionality for text.word[4] seems to
> be a waste of time.

I'm not entierly convinced, but I'll leave it for now.  Maybe it'll be
a 3.1 feature.


> The benefits gained by using the three internal representations are
> primarily from a simplicity standpoint.  That is to say, when
> manipulating any one of the three representations, you know that the
> value at offset X represents the code point of character X in the string.
>
> Further, with a slight change in how the single-segment buffer interface
> is defined (returns the width of the character), C extensions that want
> to deal with unicode strings in *native* format (due to concerns about
> speed), could do so without having to worry about reencoding,
> variable-width characters, etc.

Is it really worthwhile if there's three different formats they'd have
to handle?


> You can get this same behavior by always using UTF-32 (aka UCS-4), but
> at least 1/4 of the underlying data is always going to be nulls (code
> points are limited to 0x0010ffff), and for many people (in Europe, the
> US, and anywhere else with code points < 65536), 1/2 to 3/4 of the
> underlying data is going to be nulls.
>
> While I would imagine that people could deal with UTF-16 as an
> underlying representation (from a data waste perspective), the potential
> for varying-width characters in such an encoding is a pain in the ass
> (like it is for UTF-8).
>
> Regardless of our choice, *some platform* is going to be angry.  Why?
> GTK takes utf-8 encoded strings.  (I don't know what Qt or linux system
> calls take) Windows takes utf-16. Whatever underlying representation,
> *someone* is going to have to recode when dealing with GUI or OS-level
> operations.

Indeed, it seems like all our options are lose-lose.

Just to summarize, our requirements are:
* Full unicode range (0 through 0x10FFFF)
* Constant-time slicing using integer offsets
* Basic unit is a Code Point
* Continuous in memory

The best idea I've had so far for making UTF-8 have constant-time
sliving is to use a two level table, with the second level having one
byte per code point.  However, that brings up the minimum size to
(more than) 2 bytes per code point, ruining any space advantage that
utf-8 had.

UTF-16 is in the same boat, but it's (more than) 3 bytes per code point.

I think the only viable options (without changing the requirements)
are straight UCS-4 or three-way (Latin-1/UCS-2/UCS-4).  The size
variability of three-way doesn't seem so important when it's only
competitor is straight UCS-4.

The deciding factor is what we want to expose to third-party interfaces.

Sane interface (not bytes/code units), good efficiency, C-accessible: pick two.

-- 
Adam Olsen, aka Rhamphoryncus

From rasky at develer.com  Thu Sep 21 01:00:32 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Thu, 21 Sep 2006 01:00:32 +0200
Subject: [Python-3000] Removing __del__
References: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com>
	<fb6fbf560609201204u2cb14ba1qb15854fd8444bc98@mail.gmail.com>
Message-ID: <001501c6dd08$93c73260$7b4b2597@bagio>

Jim Jewett <jimjjewett at gmail.com> wrote:

>>> I believe my example is a good use case for __del__ with no good
>>> enough workaround, ... I still like the __close__ method being
>>> proposed.
>
> [Michael asks about this alternative]
> ...
>> def on_del_invoke(obj, func, *args, **kwargs):
> ...
>>      Please note that the callable must not be a bound method
>>      of the object being watched, and the object being watched
>>      must not be (or be refered to by) one of the arguments
>>      or else the object will never be garbage collected."""
>
> By far the most frequently desired callable is self.close
>
> You can work around this with a wrapper, by setting self.f=open(...)
> and then passing self.f.close -- but with this API, I'll be wondering
> why I can't just register self.f as the object in the first place.
>
> If bound methods did not increment the refcount, this would work, but
> I imagine it would break various GUI and event-processing idioms.
>
> A special rebind-this-method-weakly builtin would work, but I'm not
> sure that is any simpler than __close__.  (~= __del__ but cycles can
> be broken in an arbitrary order)

I once wrote a simple weakref wrapper, which binds methods weakly (it's pretty
easy to write). I thought it would have been dramatically useful one day, and I
still have to use it once :) And yes, I agree that __close__ is a much easier
solution to the problem.

> Note that the wrapper (as posted) does nothing except store a pointer
> to the CAPI object and then delegate to it.  With a __close__ method,
> this class could reduce to (at most)
>
>     class MyCAPI(CAPI):
>         __close__ = close

Ehm, can a class be derived from a module?

Giovanni Bajo


From rhamph at gmail.com  Thu Sep 21 01:02:49 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Wed, 20 Sep 2006 17:02:49 -0600
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>
	<ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
Message-ID: <aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com>

On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> On 9/20/06, Michael Chermside <mcherm at mcherm.com> wrote:
> > I wrote:
> > >>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.'
> > >>> msg[35:-18]
> > u'"\U00010143"'
> > >>> greek_five = msg[36:-19]
> > >>> len(greek_five)
> > 2
> >
> >
> > After posting, I realized that it's worse than that. I suspect that if
> > I tried this on a CPython compiled with wide characters, then
> > len(greek_five) would be 1.
> >
> > What should it be? 2? 1? Implementation-dependent?
>
> This has all been rehashed endlessly. It's implementation (and
> platform- and compilation options-) dependent because there are good
> reasons for both choices. Even if CPython 3.0 supports a dynamic
> choice (which some are proposing) then the *language* will still make
> it implementation dependent because of Jython and IronPython, where
> the only choice is UTF-16 (or UCS-2, depending the attitude towards
> surrogates).

Wow, you really did mean code units.  In that case I'm very tempted to
support UTF-8, with byte indexing (which is what code units are in its
case).  It's ugly, but it technically works fine, and it's the de
facto standard on Linux.  No more ugly than UTF-16 code units IMO,
just more obvious.

-- 
Adam Olsen, aka Rhamphoryncus

From guido at python.org  Thu Sep 21 01:08:01 2006
From: guido at python.org (Guido van Rossum)
Date: Wed, 20 Sep 2006 16:08:01 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>
	<ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
	<aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com>
Message-ID: <ca471dc20609201608r6779a352te79325e8f058a586@mail.gmail.com>

On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> Wow, you really did mean code units.  In that case I'm very tempted to
> support UTF-8, with byte indexing (which is what code units are in its
> case).  It's ugly, but it technically works fine, and it's the de
> facto standard on Linux.  No more ugly than UTF-16 code units IMO,
> just more obvious.

Who charged you with designing the string implementation?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From qrczak at knm.org.pl  Thu Sep 21 01:13:56 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Thu, 21 Sep 2006 01:13:56 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
	(Guido van Rossum's message of "Wed, 20 Sep 2006 15:52:16 -0700")
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>
	<ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
Message-ID: <87mz8u9lmz.fsf@qrnik.zagroda>

"Guido van Rossum" <guido at python.org> writes:

> Even if CPython 3.0 supports a dynamic choice (which some are
> proposing) then the *language* will still make it implementation
> dependent because of Jython and IronPython, where the only choice
> is UTF-16 (or UCS-2, depending the attitude towards surrogates).

Jython and IronPython could use a dual UCS-2 / UTF-32 encoding
(with some work and interoperability overhead I admit).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From rhamph at gmail.com  Thu Sep 21 01:20:29 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Wed, 20 Sep 2006 17:20:29 -0600
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ca471dc20609201608r6779a352te79325e8f058a586@mail.gmail.com>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>
	<ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
	<aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com>
	<ca471dc20609201608r6779a352te79325e8f058a586@mail.gmail.com>
Message-ID: <aac2c7cb0609201620v297d61e1je2d5709418b9c2fa@mail.gmail.com>

On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > Wow, you really did mean code units.  In that case I'm very tempted to
> > support UTF-8, with byte indexing (which is what code units are in its
> > case).  It's ugly, but it technically works fine, and it's the de
> > facto standard on Linux.  No more ugly than UTF-16 code units IMO,
> > just more obvious.
>
> Who charged you with designing the string implementation?

Last I checked, the point of mailing lists such as these was to allow
input from the community at large.

In any case, my reaction was simply because I misunderstood your intentions.

-- 
Adam Olsen, aka Rhamphoryncus

From jcarlson at uci.edu  Thu Sep 21 02:29:32 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 20 Sep 2006 17:29:32 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <aac2c7cb0609201552q605c20e0od3df500d4a13dff0@mail.gmail.com>
References: <20060920142332.0825.JCARLSON@uci.edu>
	<aac2c7cb0609201552q605c20e0od3df500d4a13dff0@mail.gmail.com>
Message-ID: <20060920165545.082B.JCARLSON@uci.edu>


"Adam Olsen" <rhamph at gmail.com> wrote:
> On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > "Adam Olsen" <rhamph at gmail.com> wrote:
> > >
> > > On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> > > >
> > > > "Adam Olsen" <rhamph at gmail.com> wrote:
> 
> [snip token stuff]
> 
> Withdrawn.  Blake Winston pointed me to some problems in private as well.
> 
> 
> > If I can't slice based on character index, then we end up with a similar
> > situation that the wxPython StyledTextCtrl runs into right now: the
> > content is encoded via utf-8 internally, so users have to use the fairly
> > annoying PositionBefore(pos) and PositionAfter(pos) methods to discover
> > where characters start/end.  While it is possible to handle everything
> > this way, it is *damn annoying*, and some users have gone so far as to
> > say that it *doesn't work* for Europeans.
> >
> > While I won't make the claim that it *doesn't work*, it is a pain in the
> > ass.
> 
> I'm going to agree with you.  That's also why I'm going to assume
> Guido meant to use Code Points, not Code Units (which would be bytes
> in the case of UTF-8).




> > > Using only utf-8 would be simpler than three distinct representations.
> > >  And if memory usage is an issue (which it seems to be, albeit in a
> > > vague way), we could make a custom encoding that's even simpler and
> > > more space efficient than utf-8.
> >
> > One of the reasons I've been pushing for the 3 representations is
> > because it is (arguably) optimal for any particular string.
> 
> It bothers me that adding a single character would cause it to double
> or quadruple in size.  May be the best compromise though.

Ahh, but the crucial observation is that the string would have been
two or four times as large initially.


> I'm not entierly convinced, but I'll leave it for now.  Maybe it'll be
> a 3.1 feature.

I'll just say, "you ain't gonna need it".  Why?  In my experience, I
rarely, if ever, say "give me the ith word" or "give me the ith line". 
I really do, "give me the first ..., and the remaing ...".  With
partition (with or without views), you can do these things quite easily.


> > The benefits gained by using the three internal representations are
> > primarily from a simplicity standpoint.  That is to say, when
> > manipulating any one of the three representations, you know that the
> > value at offset X represents the code point of character X in the string.
> >
> > Further, with a slight change in how the single-segment buffer interface
> > is defined (returns the width of the character), C extensions that want
> > to deal with unicode strings in *native* format (due to concerns about
> > speed), could do so without having to worry about reencoding,
> > variable-width characters, etc.
> 
> Is it really worthwhile if there's three different formats they'd have
> to handle?

It would depend, but any application that currently handles both utf-16
and UCS-4 builds of Python and unicode strings would require slight
modification to handle Latin-1, and could be simplified to handle UCS-2
instead of UTF-16.


> Indeed, it seems like all our options are lose-lose.
> 
> Just to summarize, our requirements are:
> * Full unicode range (0 through 0x10FFFF)
> * Constant-time slicing using integer offsets
> * Basic unit is a Code Point
> * Continuous in memory
> 
> The best idea I've had so far for making UTF-8 have constant-time
> sliving is to use a two level table, with the second level having one
> byte per code point.  However, that brings up the minimum size to
> (more than) 2 bytes per code point, ruining any space advantage that
> utf-8 had.

(I'm not advocating the following, just expressing that it could be done)

Another way of doing it would be to have the underlying string in
utf-8 (or even 16), but layer a specially crafted tree structure over
the top of it, sized in a particular manner so that (bytes used)/(bytes
required) is somewhat small.  It could offer log-time discovery of
offsets (for slicing) and memory-continuous representation. This tree
could also be generated after some k slices, avoiding the overhead of
tree creation unless we have determined it to be reasonable to ammortize.

If one chooses one node per klogn characters, then we get O(n) tree
construction time with the same big-O index discovery time, using ~24*n/
(klogn) additional bytes/string of length n (assuming a 64-bit Python).
Choose k=24 (or k=12 on a 32 bit Python), and we get the used/required
ratio of 1 + logn/n.  Not bad.


> I think the only viable options (without changing the requirements)
> are straight UCS-4 or three-way (Latin-1/UCS-2/UCS-4).  The size
> variability of three-way doesn't seem so important when it's only
> competitor is straight UCS-4.
> 
> The deciding factor is what we want to expose to third-party interfaces.
> 
> Sane interface (not bytes/code units), good efficiency, C-accessible: pick two.

I would say that both options are C-accessable, though perhaps not
optimally in either case.  Note that we can always recode for the
third-party interfaces; that's what is already done for PyGTK, wxPython
on linux, 32-bit characters on Windows, etc.

 - Josiah


From greg.ewing at canterbury.ac.nz  Thu Sep 21 02:54:29 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 21 Sep 2006 12:54:29 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <87psdqztxk.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
	<878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz>
	<87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz>
	<87psdqztxk.fsf@qrnik.zagroda>
Message-ID: <4511E2C5.2020500@canterbury.ac.nz>

Marcin 'Qrczak' Kowalczyk wrote:

> Incremental GC (e.g. in OCaml) has short pauses. It doesn't scan all
> memory at once, but distributes the work among GC cycles.

Can it be made to guarantee that no pause will
be longer than some small amount, such as 20ms?
Because that's what is needed to ensure smooth
animation.

--
Greg

From greg.ewing at canterbury.ac.nz  Thu Sep 21 03:14:23 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 21 Sep 2006 13:14:23 +1200
Subject: [Python-3000] Removing __del__
In-Reply-To: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com>
References: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com>
Message-ID: <4511E76F.9000706@canterbury.ac.nz>

Michael Chermside wrote:

>    * Programmers no longer have the ability to allow __del__
>      to resurect the object being finalized.

I've never even considered trying to write such code,
and can't think of any reason why I ever would, so
I wouldn't miss this ability at all.

--
Greg

From david.nospam.hopwood at blueyonder.co.uk  Thu Sep 21 03:09:24 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Thu, 21 Sep 2006 02:09:24 +0100
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <bbaeab100609201130q793247e4p676d9e00b15f9038@mail.gmail.com>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>	<ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>	<aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>
	<bbaeab100609201130q793247e4p676d9e00b15f9038@mail.gmail.com>
Message-ID: <4511E644.2030306@blueyonder.co.uk>

Brett Cannon wrote:
> On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
>> On 9/20/06, Guido van Rossum <guido at python.org> wrote:
>> > On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
>> > >
>> > > Before we can decide on the internal representation of our unicode
>> > > objects, we need to decide on their external interface.  My thoughts
>> > > so far:
>> >
>> > Let me cut this short. The external string API in Py3k should not
>> > change or only very marginally so (like removing rarely used useless
>> > APIs or adding a few new conveniences). The plan is to keep the 2.x
>> > API that is supported (in 2.x) by both str and unicode, but merge the
>> > two string types into one. Anything else could be done just as easily
>> > before or after Py3k.
>>
>> Thanks, but one thing remains unclear: is the indexing intended to
>> represent bytes, code points, or code units?  Note that C code
>> operating on UTF-16 would use code units for slicing of UTF-16, which
>> splits surrogate pairs.
> 
> Assuming my Unicode lingo is right and code point represents a
> letter/character/digraph/whatever, then it will be a code point.  Doing one
> of my rare channels of Guido, I *really* doubt he wants to expose the
> technical details of Unicode to the point of having people need to realize
> that UTF-8 takes two bytes to represent "?".

The argument used here is not valid. People do need to realize that *all*
Unicode encodings are variable-length, in the sense that abstract characters
can be represented by multiple code points.

For example, "?" can be represented either as the precomposed character U+00F6,
or as "o" followed by a combining diaeresis (U+006F U+0308). Programs must
avoid splitting sequences of code points that represent a single abstract
character. A program that does that correctly will automatically also avoid
splitting within the representation of a code point, whatever UTF is used.

> If you want that kind of
> exposure, use the bytes type.  Otherwise assume the usage will be by people
> ignorant of Unicode and thus want something that will work the way they are
> used to when compared to working in ASCII.

It simply is not possible to do correct string processing in Unicode that
will "work the way [programmers] are used to when compared to working in ASCII".

The Unicode standard is on-line at www.unicode.org, and is quite well written,
with lots of motivation and explanation of how processing international texts
necessarily differs from working with ASCII. There is no excuse for any
programmer doing text processing not to have read it.

Should we nevertheless try to avoid making the use of Unicode strings
unnecessarily difficult for people who have minimal knowledge of Unicode?
Absolutely, but not at the expense of making basic operations on strings
asymptotically less efficient. O(1) indexing and slicing is a basic
requirement, even if it has to be done using code units.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>




From greg.ewing at canterbury.ac.nz  Thu Sep 21 03:34:14 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Thu, 21 Sep 2006 13:34:14 +1200
Subject: [Python-3000] Removing __del__
In-Reply-To: <fb6fbf560609201204u2cb14ba1qb15854fd8444bc98@mail.gmail.com>
References: <20060920092756.eng2b1ul40mtc8sg@login.werra.lunarpages.com>
	<fb6fbf560609201204u2cb14ba1qb15854fd8444bc98@mail.gmail.com>
Message-ID: <4511EC16.4030307@canterbury.ac.nz>

Jim Jewett wrote:

> How do you feel about the __del__ in stdlib subprocess.Popen (about line 615)?
> 
> This resurrects itself, in order to finish waiting for the child
> process.

I don't see a need for resurrection here. Why can't it
create another object holding the necessary info for
doing the waiting?

> (And note that if it needed to revive (not recreate, revive)
> subobjects, it would need the full immortal-cycle power of today's
> __del__.

Any subobjects which may need to be preserved can be
passed as arguments to the finalizer, which can then
prevent them from dying in the first place if it wants.

I'm far from convinced that there's ever a *need* for
resurrection.

--
Greg

From jcarlson at uci.edu  Thu Sep 21 03:58:30 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 20 Sep 2006 18:58:30 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <4511E644.2030306@blueyonder.co.uk>
References: <bbaeab100609201130q793247e4p676d9e00b15f9038@mail.gmail.com>
	<4511E644.2030306@blueyonder.co.uk>
Message-ID: <20060920184147.082F.JCARLSON@uci.edu>


David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Brett Cannon wrote:
[snip]
> > If you want that kind of
> > exposure, use the bytes type.  Otherwise assume the usage will be by people
> > ignorant of Unicode and thus want something that will work the way they are
> > used to when compared to working in ASCII.
> 
> It simply is not possible to do correct string processing in Unicode that
> will "work the way [programmers] are used to when compared to working in ASCII".
> 
> The Unicode standard is on-line at www.unicode.org, and is quite well written,
> with lots of motivation and explanation of how processing international texts
> necessarily differs from working with ASCII. There is no excuse for any
> programmer doing text processing not to have read it.

Since, basically everyone using Python today performs "text processing"
in one way or another, you are saying that basically everyone should be
reading the Unicode spec before using Python.  Nevermind that the
document is generally larger than most people want to be reading, and
that you didn't provide a link to the most applicable section (with
regards to *using* unicode).  I will also mention that in the unicode
4.0 spec, Chapter 5 "Implementation Guidelines" starts with:

'''
It is possible to implement a substantial subset of the Unicode Standard
as "wide ASCII" with little change to existing programming practice. ...
'''

It later goes on to explain where "wide ASCII" is not a reasonable
strategy, but I'm not sure that users of Python necessarily need to know
all of that.


> Should we nevertheless try to avoid making the use of Unicode strings
> unnecessarily difficult for people who have minimal knowledge of Unicode?
> Absolutely, but not at the expense of making basic operations on strings
> asymptotically less efficient. O(1) indexing and slicing is a basic
> requirement, even if it has to be done using code units.

I believe you mean "code points", "code units" imply non-O(1) indexing
and slicing (variable-width characters).


 - Josiah


From guido at python.org  Thu Sep 21 03:55:24 2006
From: guido at python.org (Guido van Rossum)
Date: Wed, 20 Sep 2006 18:55:24 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <aac2c7cb0609201620v297d61e1je2d5709418b9c2fa@mail.gmail.com>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>
	<ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
	<aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com>
	<ca471dc20609201608r6779a352te79325e8f058a586@mail.gmail.com>
	<aac2c7cb0609201620v297d61e1je2d5709418b9c2fa@mail.gmail.com>
Message-ID: <ca471dc20609201855t711c22bev500baa7eb8371e09@mail.gmail.com>

On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> > On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > > Wow, you really did mean code units.  In that case I'm very tempted to
> > > support UTF-8, with byte indexing (which is what code units are in its
> > > case).  It's ugly, but it technically works fine, and it's the de
> > > facto standard on Linux.  No more ugly than UTF-16 code units IMO,
> > > just more obvious.
> >
> > Who charged you with designing the string implementation?
>
> Last I checked, the point of mailing lists such as these was to allow
> input from the community at large.
>
> In any case, my reaction was simply because I misunderstood your intentions.

I was specifically reacting to your use of the phrasing "I'm very
tempted to support UTF-8"; this wording suggests that it would be your
choice to make. I could have pointed out the obvious (that equating
the difficulty of using UTF-8 with that of using UTF-16 doesn't make
it so) but I figured the other readers are also tired of your attempts
to move this into an entirely different direction, and based on a
thorough lack of understanding of the status quo no less.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)

From rhamph at gmail.com  Thu Sep 21 04:12:35 2006
From: rhamph at gmail.com (Adam Olsen)
Date: Wed, 20 Sep 2006 20:12:35 -0600
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ca471dc20609201855t711c22bev500baa7eb8371e09@mail.gmail.com>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>
	<ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
	<aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com>
	<ca471dc20609201608r6779a352te79325e8f058a586@mail.gmail.com>
	<aac2c7cb0609201620v297d61e1je2d5709418b9c2fa@mail.gmail.com>
	<ca471dc20609201855t711c22bev500baa7eb8371e09@mail.gmail.com>
Message-ID: <aac2c7cb0609201912j7c77f24dp88c7aabd614a056@mail.gmail.com>

On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > On 9/20/06, Guido van Rossum <guido at python.org> wrote:
> > > On 9/20/06, Adam Olsen <rhamph at gmail.com> wrote:
> > > > Wow, you really did mean code units.  In that case I'm very tempted to
> > > > support UTF-8, with byte indexing (which is what code units are in its
> > > > case).  It's ugly, but it technically works fine, and it's the de
> > > > facto standard on Linux.  No more ugly than UTF-16 code units IMO,
> > > > just more obvious.
> > >
> > > Who charged you with designing the string implementation?
> >
> > Last I checked, the point of mailing lists such as these was to allow
> > input from the community at large.
> >
> > In any case, my reaction was simply because I misunderstood your intentions.
>
> I was specifically reacting to your use of the phrasing "I'm very
> tempted to support UTF-8"; this wording suggests that it would be your
> choice to make. I could have pointed out the obvious (that equating
> the difficulty of using UTF-8 with that of using UTF-16 doesn't make
> it so) but I figured the other readers are also tired of your attempts
> to move this into an entirely different direction, and based on a
> thorough lack of understanding of the status quo no less.

It was poor wording then.  I never intended to imply that it was my
choice.  Instead, I was referring to the input I have as a member of
the community.

I am not attempting to move this in a different direction.  I (and
apparently several other people) thought it always was a different
direction.  It is obvious now that it wasn't your intent to use code
points, and I can accept that code units are the best (most efficient)
choice.

-- 
Adam Olsen, aka Rhamphoryncus

From fredrik at pythonware.com  Thu Sep 21 09:00:56 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Thu, 21 Sep 2006 09:00:56 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>	<ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
	<aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com>
Message-ID: <eetdb7$dov$1@sea.gmane.org>

Adam Olsen wrote:

> Wow, you really did mean code units.  In that case I'm very tempted to
> support UTF-8, with byte indexing (which is what code units are in its
> case).  It's ugly, but it technically works fine, and it's the de
> facto standard on Linux.  No more ugly than UTF-16 code units IMO,
> just more obvious.

*plonk*

</F>


From fredrik at pythonware.com  Thu Sep 21 09:16:01 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Thu, 21 Sep 2006 09:16:01 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ca471dc20609201855t711c22bev500baa7eb8371e09@mail.gmail.com>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>	<ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>	<aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com>	<ca471dc20609201608r6779a352te79325e8f058a586@mail.gmail.com>	<aac2c7cb0609201620v297d61e1je2d5709418b9c2fa@mail.gmail.com>
	<ca471dc20609201855t711c22bev500baa7eb8371e09@mail.gmail.com>
Message-ID: <eete7i$gr4$1@sea.gmane.org>

Guido van Rossum wrote:

> based on a thorough lack of understanding of the status quo no less.

that's, unfortunately, a bit too common on this list.

(as the author of Python's Unicode type and cElementTree, I especially 
like arguments along the lines of "using separate buffers to hold the 
actual character data is not feasible" and "that narrow storage would 
have any advantages over wide storage is far from proven".  nice try, 
guys ;-)

</F>


From fredrik at pythonware.com  Thu Sep 21 09:21:34 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Thu, 21 Sep 2006 09:21:34 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <4511E644.2030306@blueyonder.co.uk>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>	<ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>	<aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>	<bbaeab100609201130q793247e4p676d9e00b15f9038@mail.gmail.com>
	<4511E644.2030306@blueyonder.co.uk>
Message-ID: <eetehu$gr4$2@sea.gmane.org>

David Hopwood wrote:

> For example, "?" can be represented either as the precomposed character U+00F6,
> or as "o" followed by a combining diaeresis (U+006F U+0308).

normalization is a good thing, though:

     http://www.w3.org/TR/charmod-norm/

(it would probably be a good idea to turn unicodedata.normalize into a 
method for the new unicode string type).

</F>


From qrczak at knm.org.pl  Thu Sep 21 09:51:02 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Thu, 21 Sep 2006 09:51:02 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <4511E2C5.2020500@canterbury.ac.nz> (Greg Ewing's message of
	"Thu, 21 Sep 2006 12:54:29 +1200")
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
	<878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz>
	<87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz>
	<87psdqztxk.fsf@qrnik.zagroda> <4511E2C5.2020500@canterbury.ac.nz>
Message-ID: <87slil1wux.fsf@qrnik.zagroda>

Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

>> Incremental GC (e.g. in OCaml) has short pauses. It doesn't scan all
>> memory at once, but distributes the work among GC cycles.
>
> Can it be made to guarantee that no pause will
> be longer than some small amount, such as 20ms?

It's not hard realtime. There are no strict guarantees, and a single
large object is processed in whole.

Python also processes large objects in whole.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From qrczak at knm.org.pl  Thu Sep 21 11:22:29 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Thu, 21 Sep 2006 11:22:29 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <4511E644.2030306@blueyonder.co.uk> (David Hopwood's message of
	"Thu, 21 Sep 2006 02:09:24 +0100")
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>
	<ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>
	<aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>
	<bbaeab100609201130q793247e4p676d9e00b15f9038@mail.gmail.com>
	<4511E644.2030306@blueyonder.co.uk>
Message-ID: <87venh60bu.fsf@qrnik.zagroda>

David Hopwood <david.nospam.hopwood at blueyonder.co.uk> writes:

> People do need to realize that *all* Unicode encodings are
> variable-length, in the sense that abstract characters can be
> represented by multiple code points.

Unicode algorithms for case mapping, word splitting, collation etc.
are generally defined in terms of code points. Character database is
keyed by code points, which is the largest practical text unit with
a finite domain.

Even if on the high level there are some other units, any algorithm
which determines these high level text boundaries is easier to
implement in terms of code points than in terms of even lower-level
UTF-x code units.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From ncoghlan at iinet.net.au  Thu Sep 21 12:26:18 2006
From: ncoghlan at iinet.net.au (Nick Coghlan)
Date: Thu, 21 Sep 2006 20:26:18 +1000
Subject: [Python-3000] Removing __del__
In-Reply-To: <fb6fbf560609200543l70f16862p750da26b18eb66da@mail.gmail.com>
References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>	
	<451123A2.7040701@gmail.com>
	<fb6fbf560609200543l70f16862p750da26b18eb66da@mail.gmail.com>
Message-ID: <451268CA.3030307@iinet.net.au>

Second attempt, this time to the right list :)

Jim Jewett wrote:
> On 9/20/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
>>              # Create a class with the same instance attributes
>>              # as the original
>>              class attr_holder(object):
>>                  pass
>>              finalizer_arg = attr_holder()
>>              finalizer_arg.__dict__ = self.__dict__
> 
> Does this really work?

It works for normal user-defined classes at least:

>>> class C1(object):
...     pass
...
>>> class C2(object):
...     pass
...
>>> a = C1()
>>> b = C2()
>>> b.__dict__ = a.__dict__
>>> a.x = 1
>>> b.x
1

> (1)  for classes with a dictproxy of some sort, you might get either a
> copy (which isn't updated)

Classes that change the way __dict__ is handled would probably need to define
their own __del_arg__.

> (2)  for other classes, self might be added to the dict later

Yeah, that's the strongest argument I know of against having that default
fallback - it can easily lead to a strong reference from sys.finalizers into
an otherwise unreachable cycle. I believe it currently takes two __del__
methods to prevent a cycle from being collected, whereas in this set up it
would only take one.

OTOH, fixing it would be much easier than it is now (by setting __del_args__
to something that holds only the subset of attributes that require finalization).

> and of course, if it isn't added later, then it doesn't hvae the full
> power of current finalizers -- just the __close__ subset.

True, but most finalizers I've seen don't really *need* the full power of the
current __del__. They only need to get at a couple of their internal members
in order to explicitly release external resources.

And more sophisticated usage is still possible by assigning an appropriate
value to __del_arg__.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From rasky at develer.com  Thu Sep 21 12:28:51 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Thu, 21 Sep 2006 12:28:51 +0200
Subject: [Python-3000] How will unicode get used?
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com><ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com><aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com><ca471dc20609201608r6779a352te79325e8f058a586@mail.gmail.com><aac2c7cb0609201620v297d61e1je2d5709418b9c2fa@mail.gmail.com>
	<ca471dc20609201855t711c22bev500baa7eb8371e09@mail.gmail.com>
Message-ID: <015c01c6dd68$bb6bcfa0$e303030a@trilan>

Guido van Rossum wrote:

> I was specifically reacting to your use of the phrasing "I'm very
> tempted to support UTF-8"; this wording suggests that it would be your
> choice to make. I could have pointed out the obvious (that equating
> the difficulty of using UTF-8 with that of using UTF-16 doesn't make
> it so) but I figured the other readers are also tired of your attempts
> to move this into an entirely different direction, and based on a
> thorough lack of understanding of the status quo no less.

Is there a design document explaining the rationale of unicode type, the
status quo? Any time this subject is raised on the mailing list, the net
result is "you guys don't understand unicode". Well, let us know what is
good and what is bad of the current unicode type; what is by design and what
is an implementation detail; what you want to absolutely keep, and what you
want to absolutely change. I am *really* confused about the status quo of
the unicode type (which is why I keep myself out of technical discussions on
the matter of course). Is there any desire to let people understand and join
the discussion?

Or otherwise, let's decide that the unicode type in Py3k will not be
publically discussed and will be handled only by the experts. This would
save us from these "attempts" as well.
-- 
Giovanni Bajo


From ncoghlan at gmail.com  Thu Sep 21 12:31:25 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Thu, 21 Sep 2006 20:31:25 +1000
Subject: [Python-3000] Removing __del__
In-Reply-To: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com>
References: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com>
Message-ID: <451269FD.30508@gmail.com>

Michael Chermside wrote:
> Nick Coghlan writes:
>    [...proposes revision of __del__ rather than removal...]
>> The only way for __del__ to receive a reference to self is if the 
>> finalizer argument had a reference to it - but that would mean the 
>> object itself was not
>> collectable, so __del__ wouldn't be called in the first place.
>>
>> That all seems too simple, though. Since we're talking about gc and 
>> that's never simple, there has to be something wrong with the idea :)
> 
> Unfortunately you're right... this is all too simple. The existing
> mechanism doesn't have a problem with __del__ methods that do not
> participate in loops. For those that DO participate in loops I
> think it's perfectly plausible for your __del__ to receive a reference
> to the actual object being finalized.

Nope. If the argument to __del__ has a strong reference to the object, that 
object simply won't get finalized at all because it's not in an unreachable 
cycle. sys.finalizers would act as a global root for all objects reachable 
from finalizers (with those refcounts only be decremented when the callback 
removes the weakref object from the finalizer set).

> Another problem (but less important as it's trivially fixable) is that
> you're storing away the values that the object had when it was created,
> perhaps missing out on things that got added or initialized later.

The default fallback doesn't do that - it stores a reference to the instance 
dictionary of the object so it sees later modifications and additions.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From fredrik at pythonware.com  Thu Sep 21 12:41:08 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Thu, 21 Sep 2006 12:41:08 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <015c01c6dd68$bb6bcfa0$e303030a@trilan>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com><ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com><aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com><ca471dc20609201608r6779a352te79325e8f058a586@mail.gmail.com><aac2c7cb0609201620v297d61e1je2d5709418b9c2fa@mail.gmail.com>	<ca471dc20609201855t711c22bev500baa7eb8371e09@mail.gmail.com>
	<015c01c6dd68$bb6bcfa0$e303030a@trilan>
Message-ID: <eetq84$orl$1@sea.gmane.org>

Giovanni Bajo wrote:

> Is there a design document explaining the rationale of unicode type, the
> status quo? 

Guido isn't complaining about people who don't understand the rationale 
behind the design, he's complaining about people who HAVEN'T EVEN LOOKED 
AT THE CURRENT DESIGN before spouting off random proposals.

</F>


From gabor at nekomancer.net  Thu Sep 21 12:50:30 2006
From: gabor at nekomancer.net (=?ISO-8859-1?Q?G=E1bor_Farkas?=)
Date: Thu, 21 Sep 2006 12:50:30 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>
	<ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
Message-ID: <45126E76.9020600@nekomancer.net>

Guido van Rossum wrote:
> On 9/20/06, Michael Chermside <mcherm at mcherm.com> wrote:
>> I wrote:
>>>>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.'
>>>>> msg[35:-18]
>> u'"\U00010143"'
>>>>> greek_five = msg[36:-19]
>>>>> len(greek_five)
>> 2
>>
>>
>> After posting, I realized that it's worse than that. I suspect that if
>> I tried this on a CPython compiled with wide characters, then
>> len(greek_five) would be 1.
>>
>> What should it be? 2? 1? Implementation-dependent?
> 
> This has all been rehashed endlessly. It's implementation (and
> platform- and compilation options-) dependent because there are good
> reasons for both choices. 

while i understand the constraints, i think it's not a good decision to 
leave this to be implementation-dependent.

the strings seem to me as such a basic functionality, that it's 
behaviour should not depend on the platform.

for example, how is an application developer then supposed to write 
their applications?

should he write his own slicing/whatever functions to get consistent 
behaviour on linux/windows?

i think this is not just a 'theoretical' issue. it's a very practical 
issue. the only reason why it does not seem to be important, because 
currently not much of the non-16-bit unicode characters are used.

(and this situation seems to be quite similar to that one, when only 
8byte-characters were used :-)

btw. an idea:

==============
maybe this 'problem' should be separated into 2 issues:

1. representation of the unicode string (utf-16 or utf-32)
2. behaviour of the unicode strings in python-3000

of course there are some dependencies between them. (mostly the 
performance of #2)

so why don't we make the *behaviour* cross-platform, and the 
*performance characteristics* and the *representation* platform-dependent?

(means that jython/ironpython could use utf-16, but would slice strings 
slower (because of the surrogate-issues))
================

> Even if CPython 3.0 supports a dynamic
> choice (which some are proposing) then the *language* will still make
> it implementation dependent because of Jython and IronPython, where
> the only choice is UTF-16 (or UCS-2, depending the attitude towards
> surrogates).
> 

i don't see why there should be the only choice utf-16. it's the 
obvious/most-convenient choice for jython/ironpython, that's correct. 
but (correct me if i'm wrong), ironPython or jython could support utf-32 
characters. but it of course would mean that they could not use the 
'platform''s string for their string handling.

but the same way i could say, that because most of the unix-world is 
utf-8, for those pythons the best way is to handle it internally as 
utf-8, couldn't i?

it simply seems to me strange to make compromises that makes the life of 
the cpython-users harder, just to make the life for the 
jython/ironpython developers (i mean the 'creators') easier.

gabor

From rasky at develer.com  Thu Sep 21 14:00:05 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Thu, 21 Sep 2006 14:00:05 +0200
Subject: [Python-3000] Removing __del__
References: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com>
	<451269FD.30508@gmail.com>
Message-ID: <042501c6dd75$79f2df70$e303030a@trilan>

Nick Coghlan wrote:

>> Unfortunately you're right... this is all too simple. The existing
>> mechanism doesn't have a problem with __del__ methods that do not
>> participate in loops. For those that DO participate in loops I
>> think it's perfectly plausible for your __del__ to receive a
>> reference to the actual object being finalized.
>
> Nope. If the argument to __del__ has a strong reference to the
> object, that object simply won't get finalized at all because it's
> not in an unreachable cycle.

What if the "self" passed to __del__ was instead a weakref.proxy, or a
similar wrapper object which does not give you access to the object itself
but lets you access its attributes? The object could have been already
collected for what I care, what I really need is to be able to say
"self.foo" to access what used to be a "foo" member of self. You can create
a totally different object of any type but with the same __dict__. Ok, it's
not that easy (properties, etc.), but you get the idea.

Am I missing something?
-- 
Giovanni Bajo


From qrczak at knm.org.pl  Thu Sep 21 14:54:26 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Thu, 21 Sep 2006 14:54:26 +0200
Subject: [Python-3000] Removing __del__
In-Reply-To: <042501c6dd75$79f2df70$e303030a@trilan> (Giovanni Bajo's
	message of "Thu, 21 Sep 2006 14:00:05 +0200")
References: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com>
	<451269FD.30508@gmail.com> <042501c6dd75$79f2df70$e303030a@trilan>
Message-ID: <87u031xtvh.fsf@qrnik.zagroda>

"Giovanni Bajo" <rasky at develer.com> writes:

> What if the "self" passed to __del__ was instead a weakref.proxy,
> or a similar wrapper object which does not give you access to the
> object itself but lets you access its attributes?

weakref.proxy will find the object already dead.

I doubt this can be done fully automatically.

The basic design is splitting the object into an outer part handled to
clients, which is watched to become unreachable, and a private inner
part used to physically access the resource, including releasing it.
I see no good way around it.

Often the inner part is a single field which is already separated.
In other cases it might require an extra indirection, in particular
if it's a mutable field.

This design distinguishes between related objects which are needed
during finalization (fields of the inner object) and related objects
which are not (fields of the outer object).

Cycles involving only outer objects are harmless, they can be safely
freed together, triggering finalization of all associated objects.
Inner objects may also refer to most other objects, ensuring that
they are not finalized earlier. But a path from an inner object to its
associated outer object prevents it from being finalized and is a bug
in the program (unless it is broken before the object loses all other
references).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From theller at python.net  Thu Sep 21 15:24:52 2006
From: theller at python.net (Thomas Heller)
Date: Thu, 21 Sep 2006 15:24:52 +0200
Subject: [Python-3000] Small Py3k task: fix modulefinder.py
In-Reply-To: <ca471dc20608291442p3d92790ema7aa35f85d38156a@mail.gmail.com>
References: <ca471dc20608291442p3d92790ema7aa35f85d38156a@mail.gmail.com>
Message-ID: <eeu3r3$rd8$2@sea.gmane.org>

Guido van Rossum schrieb:
> Is anyone familiar enough with modulefinder.py to fix its breakage in
> Py3k? It chokes in a nasty way (exceeding the recursion limit) on the
> relative import syntax. I suspect this is also a problem for 2.5, when
> people use that syntax; hence the cross-post. There's no unittest for
> modulefinder.py, but I believe py2exe depends on it (and of course
> freeze.py, but who uses that still?)
> 

I'm not (yet) using relative imports in 2.5 or Py3k, but have not been able
to reproduce the recursion limit problem.  Can you describe the package
that fails?

Thanks,
Thomas


From david.nospam.hopwood at blueyonder.co.uk  Thu Sep 21 21:41:54 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Thu, 21 Sep 2006 20:41:54 +0100
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <eetehu$gr4$2@sea.gmane.org>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>	<ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>	<aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>	<bbaeab100609201130q793247e4p676d9e00b15f9038@mail.gmail.com>	<4511E644.2030306@blueyonder.co.uk>
	<eetehu$gr4$2@sea.gmane.org>
Message-ID: <4512EB02.6@blueyonder.co.uk>

Fredrik Lundh wrote:
> David Hopwood wrote:
> 
>>For example, "?" can be represented either as the precomposed character U+00F6,
>>or as "o" followed by a combining diaeresis (U+006F U+0308).
> 
> normalization is a good thing, though:
> 
>      http://www.w3.org/TR/charmod-norm/
> 
> (it would probably be a good idea to turn unicodedata.normalize into a 
> method for the new unicode string type).

Normalization is certainly a good thing to support. But that's orthogonal to
my point above -- that some abstract characters are representable by sequences
of more than one code point, which must not be split, and that avoidance of such
splitting automatically also avoids splitting within a code point representation.

Note that some abstract characters needed for living languages are representable
*only* by combining sequences.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>




From greg.ewing at canterbury.ac.nz  Fri Sep 22 01:57:08 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Fri, 22 Sep 2006 11:57:08 +1200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <87slil1wux.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
	<878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz>
	<87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz>
	<87psdqztxk.fsf@qrnik.zagroda> <4511E2C5.2020500@canterbury.ac.nz>
	<87slil1wux.fsf@qrnik.zagroda>
Message-ID: <451326D4.5030401@canterbury.ac.nz>

Marcin 'Qrczak' Kowalczyk wrote:

> It's not hard realtime. There are no strict guarantees, and a single
> large object is processed in whole.

I know. What I mean to say, I think, is can it be
designed so that there cannot be any pauses longer
than there would have been if freeing had been
performed as early as possible by refcounting.

--
Greg

From michel at dialnetwork.com  Fri Sep 22 04:22:55 2006
From: michel at dialnetwork.com (Michel Pelletier)
Date: Thu, 21 Sep 2006 19:22:55 -0700
Subject: [Python-3000] Kill GIL? - to PEP 3099?
In-Reply-To: <mailman.888.1158774462.10490.python-3000@python.org>
References: <mailman.888.1158774462.10490.python-3000@python.org>
Message-ID: <1158891775.14240.7.camel@amdy>


> Fredrik Lundh wrote:
> 
>  > no need to wait for Guido for this: adding library support for shared-
>  > memory dictionaries/lists is a no-brainer.  if you have experience in
>  > this field, start hacking.  I'll take care of the rest ;-)
> 
> and you don't need to wait for Python 3000 either, of course -- if done 
> right, this would certainly fit into some future 2.X release.

Here's a straight wrapper around the OSSP mm shared memory library:

http://70.103.91.130/~michel/pymm-0.1.tgz

I've only minimally tested it on AMD64 linux.  It exposes mm shared
memory regions as buffers and only wraps the "Standard" mm API.  Hacking
the Python memory system to put objects in shared memory is too deep for
me.  Included is a test that uses cPickle to share object state between
forked processes.  It needs a lot more testing and tweaking but it works
as a proof of concept.

-Michel


From ncoghlan at gmail.com  Fri Sep 22 13:02:38 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Fri, 22 Sep 2006 21:02:38 +1000
Subject: [Python-3000] Removing __del__
In-Reply-To: <87u031xtvh.fsf@qrnik.zagroda>
References: <20060920062401.70tb7a65nxus0skg@login.werra.lunarpages.com>	<451269FD.30508@gmail.com>
	<042501c6dd75$79f2df70$e303030a@trilan>
	<87u031xtvh.fsf@qrnik.zagroda>
Message-ID: <4513C2CE.4050506@gmail.com>

Marcin 'Qrczak' Kowalczyk wrote:
> "Giovanni Bajo" <rasky at develer.com> writes:
> 
>> What if the "self" passed to __del__ was instead a weakref.proxy,
>> or a similar wrapper object which does not give you access to the
>> object itself but lets you access its attributes?
> 
> weakref.proxy will find the object already dead.
> 
> I doubt this can be done fully automatically.
> 
> The basic design is splitting the object into an outer part handled to
> clients, which is watched to become unreachable, and a private inner
> part used to physically access the resource, including releasing it.
> I see no good way around it.
> 
> Often the inner part is a single field which is already separated.
> In other cases it might require an extra indirection, in particular
> if it's a mutable field.
> 
> This design distinguishes between related objects which are needed
> during finalization (fields of the inner object) and related objects
> which are not (fields of the outer object).
> 
> Cycles involving only outer objects are harmless, they can be safely
> freed together, triggering finalization of all associated objects.
> Inner objects may also refer to most other objects, ensuring that
> they are not finalized earlier. But a path from an inner object to its
> associated outer object prevents it from being finalized and is a bug
> in the program (unless it is broken before the object loses all other
> references).

Exactly. My strawman design made the default inner object a simple class with 
the same instance dictionary as the outer object so that most current __del__ 
implementations would 'just work', but it poses the problem of making it easy 
to inadvertently create an immortal cycle.

OTOH, that can already happen today, and the __del_arg__ mechanism provides an 
easy way of ensuring it doesn't happen.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From qrczak at knm.org.pl  Fri Sep 22 13:54:19 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Fri, 22 Sep 2006 13:54:19 +0200
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <451326D4.5030401@canterbury.ac.nz> (Greg Ewing's message of
	"Fri, 22 Sep 2006 11:57:08 +1200")
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
	<878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz>
	<87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz>
	<87psdqztxk.fsf@qrnik.zagroda> <4511E2C5.2020500@canterbury.ac.nz>
	<87slil1wux.fsf@qrnik.zagroda> <451326D4.5030401@canterbury.ac.nz>
Message-ID: <87lkocxgk4.fsf@qrnik.zagroda>

Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

> I know. What I mean to say, I think, is can it be designed so that
> there cannot be any pauses longer than there would have been if
> freeing had been performed as early as possible by refcounting.

The question is misleading: refcounting also causes pauses, but at
different times and with different length distribution. An incremental
GC generally has pauses which are incomparable to pauses of refcounting,
i.e. it has longer pauses where refcounting had shorter pauses and
vice versa.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From barry at python.org  Fri Sep 22 15:01:59 2006
From: barry at python.org (Barry Warsaw)
Date: Fri, 22 Sep 2006 09:01:59 -0400
Subject: [Python-3000] Delayed reference counting idea
In-Reply-To: <87lkocxgk4.fsf@qrnik.zagroda>
References: <aac2c7cb0609180848q420bfc6bvccda6f42bc76eedf@mail.gmail.com>
	<8764fl87j3.fsf@qrnik.zagroda>
	<422D230A-22E0-45D4-A5DC-267081CB8FEA@python.org>
	<87eju79aut.fsf@qrnik.zagroda>
	<CDB82BF6-5DB9-4BB7-AFE6-A561A781A311@python.org>
	<878xkfc1tj.fsf@qrnik.zagroda> <451089D7.6060204@canterbury.ac.nz>
	<87hcz38367.fsf@qrnik.zagroda> <45110180.4070807@canterbury.ac.nz>
	<87psdqztxk.fsf@qrnik.zagroda> <4511E2C5.2020500@canterbury.ac.nz>
	<87slil1wux.fsf@qrnik.zagroda> <451326D4.5030401@canterbury.ac.nz>
	<87lkocxgk4.fsf@qrnik.zagroda>
Message-ID: <BE9ECBDA-BA82-406A-AC61-3183C93DCF16@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sep 22, 2006, at 7:54 AM, Marcin 'Qrczak' Kowalczyk wrote:

> Greg Ewing <greg.ewing at canterbury.ac.nz> writes:
>
>> I know. What I mean to say, I think, is can it be designed so that
>> there cannot be any pauses longer than there would have been if
>> freeing had been performed as early as possible by refcounting.
>
> The question is misleading: refcounting also causes pauses, but at
> different times and with different length distribution. An incremental
> GC generally has pauses which are incomparable to pauses of  
> refcounting,
> i.e. it has longer pauses where refcounting had shorter pauses and
> vice versa.

Python's cyclic gc can also cause long pauses if you end up with a  
ton of objects in say generation 2, because it takes time just to  
traverse them even if they can't yet be collected.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQCVAwUBRRPe2nEjvBPtnXfVAQLspwQAtFhE1UJprBg8Cf/jH6obpaP+4+T2GHsP
ZW2IUcp41jZanfVcrOOEVERyR5saQtsCRoZtaN8XTxOJ1P1ZBCXZId0kGc39MQBW
9J4RDoQ4WTdXQFIaN+15OHIkKDIaLFakX0/smdwjHfAm8QI8D8EbEoetbsu5q0nq
MbfLHc7kk2U=
=Dz8F
-----END PGP SIGNATURE-----

From mchermside at ingdirect.com  Fri Sep 22 15:04:49 2006
From: mchermside at ingdirect.com (Chermside, Michael)
Date: Fri, 22 Sep 2006 09:04:49 -0400
Subject: [Python-3000] Removing __del__
Message-ID: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>

I don't seem to have gotten anyone one board with the bold proposal
to just rip __del__ out and tell people to learn to use weakrefs.
But I'm hearing general agreement (at least among those contributing
to this thread) that it might be wise to change the status quo.

The two kinds of solutions I'm hearing are (1) those that are based
around making a helper object that gets stored as an attribute in
the object, or a list of weakrefs, or something like that, and (2)
the __close__ proposal (or perhaps keep the name __del__ but change
the semantics.

The difficulties with (1) that have been acknowledged so far are 
that the way you code things becomes somewhat less obvious, and
that there is the possibility of accidentally creating immortal
objects through reference loops.

I would like to hear someone address the weaknesses of (2). The
first I know of is that the code in your __close__ method (or 
__del__) must assume that it might have been in a reference loop
which was broken in some arbitrary place. As a result, it cannot
assume that all references it holds are still valid. To avoid
crashing the system, we'd probably have to set the broken
references to None (is that the best choice?), but can people
really write code that has to run assuming that its references
might be invalid?

A second problem I know of is, what if the code stores a reference
to self someplace? The ability for __del__ methods to resurrect
the object being finalized is one of the major sources of
complexity in the GC module, and changing the semantics to
__close__ doesn't fix this.

Does anyone defending __close__ want to address these issues?



-------- examples only below this line --------


Just in case it isn't clear enough, I wanted to put together
some examples. First, I'll do the kind of problem that __close__
handles well:

class MyClass(object):
    def __init__(self, resource1_name, resource2_name):
        self.resource1 = acquire_resource(resource1_name)
        self.resource2 = acquire_resource(resource2_name)
    def close(self):
        self.resource1.release()
        self.resource2.release()
    def __close__(self):
        self.close()

This is the simplest example I could think of for an object
which needs to call self.close() when it is freed in order to
release resources.

Now let's imagine creating a loop with such an object.

x = MyClass('db1', 'db2')
y = MyClass('db3', 'db4')
x.next = y
y.next = x

In today's world, with __del__ instead of __close__ such a
loop would be immortal (and the resources would never be
released). And it would work fine with __close__ semantics
because the __close__ method doesn't use self.next. So this
one is just fine.

The danger in __close__ is when something used (if only
indirectly) by the __close__ method participates in the loop.
We will modify the original example by adding a flush()
method which flushes the resources and calling it in close():

class MyClass2(object):
    def __init__(self, resource1_name, resource2_name):
        self.resource1 = acquire_resource(resource1_name)
        self.resource2 = acquire_resource(resource2_name)
    def flush(self):
        self.resource1.flush()
        self.resource2.flush()
        if hasattr(self, 'next'):
            self.next.flush()
    def close(self):
        self.resource1.release()
        self.resource2.release()
    def __close__(self):
        self.flush()
        self.close()

x = MyClass2('db1', 'db2')
y = MyClass2('db3', 'db4')
x.next = y
y.next = x

This version will encounter a problem. When the GC sees
the x <--> y loop it will break it somewhere... without
loss of generality, let us say it breaks the y -> x link
by setting y.next to None. Now y will be freed, so
__close__ will be called. __close__ will invoke self.flush()
which will then try to invoke self.next.flush(). But
self.next is None, so we'll get an exception and never
make it to invoking self.close().

------

The other problem I discussed is illustrated by the following
malicious code:

evil_list = []

class MyEvilClass(object):
    def __close__(self):
        evil_list.append(self)



Do the proponents of __close__ propose a way of prohibiting
this behavior? Or do we continue to include complicated
logic the GC module to support it? I don't think anyone
cares how this code behaves so long as it doesn't segfault.

-- Michael Chermside





*****************************************************************************
This email may contain confidential or privileged information. If you believe
 you have received the message in error, please notify the sender and delete 
the message without copying or disclosing it.
*****************************************************************************


From jimjjewett at gmail.com  Fri Sep 22 15:53:27 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Fri, 22 Sep 2006 09:53:27 -0400
Subject: [Python-3000] Removing __del__
In-Reply-To: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
Message-ID: <fb6fbf560609220653o372e2610he97013dda3e4b0f2@mail.gmail.com>

On 9/22/06, Chermside, Michael <mchermside at ingdirect.com> wrote:
> the code in your __close__ method (or
> __del__) must assume that it might have been in a reference loop
> which was broken in some arbitrary place. As a result, it cannot
> assume that all references it holds are still valid.

Most close methods already assume this; how well they defend against it varies.

> A second problem I know of is, what if the code stores a reference
> to self someplace? The ability for __del__ methods to resurrect
> the object being finalized is one of the major sources of
> complexity in the GC module, and changing the semantics to
> __close__ doesn't fix this.

Even if this were forbidden, __close__ could still create a new object
that revived some otherwise-dead subobjects.  Needing those exact
subobjects (as opposed to a newly created equivalent) is the only
justification I've seen for keeping the original __del__ semantics.
(And even then, I think we should have __close__ as well, for the
normal case.)


> We will modify the original example by adding a flush()
> method which flushes the resources and calling it in close():

The more careful close methods already either check a flag attribute
or use try-except.

> class MyClass2(object):
>     def __init__(self, resource1_name, resource2_name):
>         self.resource1 = acquire_resource(resource1_name)
>         self.resource2 = acquire_resource(resource2_name)
>     def flush(self):
>         self.resource1.flush()
>         self.resource2.flush()
>         if hasattr(self, 'next'):
>             self.next.flush()

Do the two resources need to be as correct as possible, or as in-sync
as possible?

If they need to be as correct as possible, this would be

    def flush(self):
        try:
            self.resource1.flush()
        except Exception:
            pass
        try:
            self.resource2.flush()
        except Exception:
            pass
        try:
            self.next.flush()    # no need to check for self.next --
just eat the exception
        except Exception:
            pass

Note that this is an additional motivation for exception expressions.
 (Or, at least, some way to write "This may fail -- I don't care" in
less than four lines.)

>     def close(self):
>         self.resource1.release()
>         self.resource2.release()
>     def __close__(self):
>         self.flush()
>         self.close()

If the resources instead need to be as in-sync as possible, then keep
the original flush, but replace __close__ with

    def __close__(self):
        try:
            self.flush()
        except Exception:
            pass
        self.close()     # exceptions here will be swallowed anyhow

> The other problem I discussed is illustrated by the following
> malicious code:

> evil_list = []

> class MyEvilClass(object):
>     def __close__(self):
>         evil_list.append(self)

> Do the proponents of __close__ propose a way of prohibiting
> this behavior? Or do we continue to include complicated
> logic the GC module to support it? I don't think anyone
> cares how this code behaves so long as it doesn't segfault.

I'll again point to the standard library module subprocess, where

    MyEvilClass ~= subprocess.Popen
    MyEvilClass.__close__  ~= subprocess.Popen.__del__
    evil_list ~= subprocess._active

It does the append only conditionally -- if it is still waiting for
the subprocess *and* python as a whole is not shutting down.

People do care how that code behaves.  If the decision is not to
support it (or to require that it be written in a more complicated
way), that may be a reasonable tradeoff, but there would be a cost.

-jJ

From rhettinger at ewtllc.com  Fri Sep 22 18:26:17 2006
From: rhettinger at ewtllc.com (Raymond Hettinger)
Date: Fri, 22 Sep 2006 09:26:17 -0700
Subject: [Python-3000] Removing __var
Message-ID: <B6FAC926EFE7B348B12F29CF7E4A93D401CF46A0@hammer.office.bhtrader.com>

I propose dropping the __var private name mangling trick for
double-underscores.

It is rarely used; it smells like a hack; it complicates instrospection
tools; it's not beautiful; and it is not in line with Python's spirit of
"we're all consenting adults".


Raymond

From rhettinger at ewtllc.com  Fri Sep 22 18:26:23 2006
From: rhettinger at ewtllc.com (Raymond Hettinger)
Date: Fri, 22 Sep 2006 09:26:23 -0700
Subject: [Python-3000] Removing __del__
In-Reply-To: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
Message-ID: <B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>

[Michael Chermside]
> I don't seem to have gotten anyone one board with the bold 
> proposal to just rip __del__ out and tell people to learn 
> to use weakrefs.  But I'm hearing general agreement (at least
> among those contributing to this thread) that it might be 
> wise to change the status quo.

I'm on-board for just ripping out __del__.

Is there anything vital that could be done with a __close__ method that
can't already be done with a weakref callback?  We aren't going to need
it.

FWIW, don't despair on your original bold proposal.  While it's fun to
free associate and generate ideas for new atrocities, I think most of
your respondants are just kicking ideas around.

In the spirit of Py3k development, I recommend being quick to remove and
slow to add.  Let 3.0 emerge without __del__ and if strong use cases
emerge, there can be a 3.1 PEP for a new magic method.  I think Py3k
should be as lean as possible and then build-up very slowly afterwards,
emphasizing cruft-removal instead of cruft-substitution.


Raymond

From krstic at solarsail.hcs.harvard.edu  Fri Sep 22 18:28:56 2006
From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=)
Date: Sat, 23 Sep 2006 00:28:56 +0800
Subject: [Python-3000] Removing __var
In-Reply-To: <B6FAC926EFE7B348B12F29CF7E4A93D401CF46A0@hammer.office.bhtrader.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF46A0@hammer.office.bhtrader.com>
Message-ID: <45140F48.7000305@solarsail.hcs.harvard.edu>

Raymond Hettinger wrote:
> I propose dropping the __var private name mangling trick for
> double-underscores.

+1.

-- 
Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> | GPG: 0x147C722D

From krstic at solarsail.hcs.harvard.edu  Fri Sep 22 18:29:58 2006
From: krstic at solarsail.hcs.harvard.edu (=?UTF-8?B?SXZhbiBLcnN0acSH?=)
Date: Sat, 23 Sep 2006 00:29:58 +0800
Subject: [Python-3000] Removing __del__
In-Reply-To: <B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
Message-ID: <45140F86.6050004@solarsail.hcs.harvard.edu>

Raymond Hettinger wrote:
> I'm on-board for just ripping out __del__. [...]
> In the spirit of Py3k development, I recommend being quick to remove and
> slow to add.  Let 3.0 emerge without __del__ and if strong use cases
> emerge, there can be a 3.1 PEP for a new magic method.  I think Py3k
> should be as lean as possible and then build-up very slowly afterwards,
> emphasizing cruft-removal instead of cruft-substitution.

+1, on all counts.

-- 
Ivan Krsti? <krstic at solarsail.hcs.harvard.edu> | GPG: 0x147C722D

From ncoghlan at gmail.com  Fri Sep 22 18:49:55 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Sat, 23 Sep 2006 02:49:55 +1000
Subject: [Python-3000] Removing __del__
In-Reply-To: <B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
References: <B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
Message-ID: <45141433.9070302@gmail.com>

Raymond Hettinger wrote:
> FWIW, don't despair on your original bold proposal.  While it's fun to
> free associate and generate ideas for new atrocities, I think most of
> your respondants are just kicking ideas around.

Who, us? ;)

> In the spirit of Py3k development, I recommend being quick to remove and
> slow to add.  Let 3.0 emerge without __del__ and if strong use cases
> emerge, there can be a 3.1 PEP for a new magic method.  I think Py3k
> should be as lean as possible and then build-up very slowly afterwards,
> emphasizing cruft-removal instead of cruft-substitution.

I'd be fine with this too (my suggestion for updated __del__ semantics was 
pure syntactic sugar for a weakref based solution), but I don't think I use 
__del__ enough for my vote on this particular topic to mean anything :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From cfbolz at gmx.de  Fri Sep 22 19:00:36 2006
From: cfbolz at gmx.de (Carl Friedrich Bolz)
Date: Fri, 22 Sep 2006 19:00:36 +0200
Subject: [Python-3000] Removing __del__
In-Reply-To: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
Message-ID: <451416B4.1060201@gmx.de>

Chermside, Michael wrote:
[snip]
> The other problem I discussed is illustrated by the following
> malicious code:
>
> evil_list = []
>
> class MyEvilClass(object):
>     def __close__(self):
>         evil_list.append(self)
>
>
>
> Do the proponents of __close__ propose a way of prohibiting
> this behavior? Or do we continue to include complicated
> logic the GC module to support it? I don't think anyone
> cares how this code behaves so long as it doesn't segfault.

I still think a rather nice solution would be to guarantee to call
__del__ (or __close__ or whatever) only once, as was discussed earlier:

http://mail.python.org/pipermail/python-dev/2005-August/055251.html

It solves all sorts of nasty problems with resurrection and cyclic GC
and it is the semantics you already get when using Jython and PyPy
(maybe IronPython too, I don't know how GC is handled in the CLR).

Now the implementation side of this is more messy, especially with
refcounting. You would need a way to store whether the object was
already finalized. I think you could steal one bit of the refcounting
field to store this information (and still have a very fast check for
whether the rest of the refcounting field is really zero, if the correct
bit is chosen).

Cheers,

Carl Fridrich Bolz


From tanzer at swing.co.at  Fri Sep 22 19:05:51 2006
From: tanzer at swing.co.at (Christian Tanzer)
Date: Fri, 22 Sep 2006 19:05:51 +0200
Subject: [Python-3000] Removing __var
In-Reply-To: Your message of "Fri, 22 Sep 2006 09:26:17 PDT."
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A0@hammer.office.bhtrader.com>
Message-ID: <E1GQoTH-0003Zp-F7@swing.co.at>


"Raymond Hettinger" <rhettinger at ewtllc.com> wrote:

> I propose dropping the __var private name mangling trick for
> double-underscores.
>
> It is rarely used; it smells like a hack; it complicates instrospection
> tools; it's not beautiful; and it is not in line with Python's spirit of
> "we're all consenting adults".

It is useful in some situations, though. In particular, I use a
metaclass that sets `__super` to the right value. This wouldn't work
without name mangling.

-- 
Christian Tanzer                                    http://www.c-tanzer.at/


From fdrake at acm.org  Fri Sep 22 19:29:17 2006
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 22 Sep 2006 13:29:17 -0400
Subject: [Python-3000] Removing __var
In-Reply-To: <E1GQoTH-0003Zp-F7@swing.co.at>
References: <E1GQoTH-0003Zp-F7@swing.co.at>
Message-ID: <200609221329.17334.fdrake@acm.org>

On Friday 22 September 2006 13:05, Christian Tanzer wrote:
 > It is useful in some situations, though. In particular, I use a
 > metaclass that sets `__super` to the right value. This wouldn't work
 > without name mangling.

This also doesn't work if two classes in the inheritance hierarchy have the 
same __name__, if I understand how you're using this.  My guess is that 
you're using calls like

    def doSomething(self, arg):
        self.__super.doSomething(arg + 1)


  -Fred

-- 
Fred L. Drake, Jr.   <fdrake at acm.org>

From jimjjewett at gmail.com  Fri Sep 22 19:52:17 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Fri, 22 Sep 2006 13:52:17 -0400
Subject: [Python-3000] Removing __del__
In-Reply-To: <451416B4.1060201@gmx.de>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<451416B4.1060201@gmx.de>
Message-ID: <fb6fbf560609221052p3edcd120p4136b6ff4b7eb595@mail.gmail.com>

On 9/22/06, Carl Friedrich Bolz <cfbolz at gmx.de> wrote:
> I still think a rather nice solution would be to guarantee to call
> __del__ (or __close__ or whatever) only once, as was discussed earlier:

How does this help?

It doesn't say how to resolve cycles.  This cycle problem is the cause
of much implementation complexity and most user frustration (because
the method doesn't get called).

Once-only does prevent objects from usefully reviving them*selves*,
but it doesn't prevent them from creating a revived copy.  Since you
still have to start at the top of a tree, they can even reuse
otherwise-dead subobjects -- which keeps most of the rest of the
complexity.

And to be honest, I'm not sure you *can* remove the complexity, so
much as you can move it.  Enforcing no-revival-even-of-subobjects is
the same tricky maze in reverse.  Saying "We don't make any promises
regarding revival" just leads to inconsistency, and make the bugs
subtler.

The advantage of the __close__ semantics is that it greatly reduces
the number of unbreakable cycles; this still doesn't avoid corner
cases, but it simplifies the average case, and therefore the typical
user experience.

-jJ

From bob at redivi.com  Fri Sep 22 20:02:19 2006
From: bob at redivi.com (Bob Ippolito)
Date: Fri, 22 Sep 2006 11:02:19 -0700
Subject: [Python-3000] Removing __var
In-Reply-To: <200609221329.17334.fdrake@acm.org>
References: <E1GQoTH-0003Zp-F7@swing.co.at> <200609221329.17334.fdrake@acm.org>
Message-ID: <6a36e7290609221102r757fa9e2r7bdf32b2e31f6eb1@mail.gmail.com>

On 9/22/06, Fred L. Drake, Jr. <fdrake at acm.org> wrote:
> On Friday 22 September 2006 13:05, Christian Tanzer wrote:
>  > It is useful in some situations, though. In particular, I use a
>  > metaclass that sets `__super` to the right value. This wouldn't work
>  > without name mangling.
>
> This also doesn't work if two classes in the inheritance hierarchy have the
> same __name__, if I understand how you're using this.  My guess is that
> you're using calls like
>
>     def doSomething(self, arg):
>         self.__super.doSomething(arg + 1)

In the one or two situations where it "is useful" you could always
write out what it would've done.

self._ThisClass__super.doSomething(arg + 1)

-bob

From theller at python.net  Fri Sep 22 20:19:52 2006
From: theller at python.net (Thomas Heller)
Date: Fri, 22 Sep 2006 20:19:52 +0200
Subject: [Python-3000] Removing __var
In-Reply-To: <6a36e7290609221102r757fa9e2r7bdf32b2e31f6eb1@mail.gmail.com>
References: <E1GQoTH-0003Zp-F7@swing.co.at> <200609221329.17334.fdrake@acm.org>
	<6a36e7290609221102r757fa9e2r7bdf32b2e31f6eb1@mail.gmail.com>
Message-ID: <ef19g8$c88$2@sea.gmane.org>

Bob Ippolito schrieb:
> On 9/22/06, Fred L. Drake, Jr. <fdrake at acm.org> wrote:
>> On Friday 22 September 2006 13:05, Christian Tanzer wrote:
>>  > It is useful in some situations, though. In particular, I use a
>>  > metaclass that sets `__super` to the right value. This wouldn't work
>>  > without name mangling.
>>
>> This also doesn't work if two classes in the inheritance hierarchy have the
>> same __name__, if I understand how you're using this.  My guess is that
>> you're using calls like
>>
>>     def doSomething(self, arg):
>>         self.__super.doSomething(arg + 1)
> 
> In the one or two situations where it "is useful" you could always
> write out what it would've done.
> 
> self._ThisClass__super.doSomething(arg + 1)

It is much more verbose, though.  The question is are you writing
this more often, or are you introspecting more often?

Thomas


From bob at redivi.com  Fri Sep 22 20:31:16 2006
From: bob at redivi.com (Bob Ippolito)
Date: Fri, 22 Sep 2006 11:31:16 -0700
Subject: [Python-3000] Removing __var
In-Reply-To: <ef19g8$c88$2@sea.gmane.org>
References: <E1GQoTH-0003Zp-F7@swing.co.at> <200609221329.17334.fdrake@acm.org>
	<6a36e7290609221102r757fa9e2r7bdf32b2e31f6eb1@mail.gmail.com>
	<ef19g8$c88$2@sea.gmane.org>
Message-ID: <6a36e7290609221131v55d52401s59a74e55b8575376@mail.gmail.com>

On 9/22/06, Thomas Heller <theller at python.net> wrote:
> Bob Ippolito schrieb:
> > On 9/22/06, Fred L. Drake, Jr. <fdrake at acm.org> wrote:
> >> On Friday 22 September 2006 13:05, Christian Tanzer wrote:
> >>  > It is useful in some situations, though. In particular, I use a
> >>  > metaclass that sets `__super` to the right value. This wouldn't work
> >>  > without name mangling.
> >>
> >> This also doesn't work if two classes in the inheritance hierarchy have the
> >> same __name__, if I understand how you're using this.  My guess is that
> >> you're using calls like
> >>
> >>     def doSomething(self, arg):
> >>         self.__super.doSomething(arg + 1)
> >
> > In the one or two situations where it "is useful" you could always
> > write out what it would've done.
> >
> > self._ThisClass__super.doSomething(arg + 1)
>
> It is much more verbose, though.  The question is are you writing
> this more often, or are you introspecting more often?

The point is that legitimate __ usage is supposedly so rare that this
verbosity doesn't matter. If it's verbose, people definitely won't use
it until they need to, where right now people do it all the time cause
it's "private".

-bob

From brett at python.org  Fri Sep 22 21:06:33 2006
From: brett at python.org (Brett Cannon)
Date: Fri, 22 Sep 2006 12:06:33 -0700
Subject: [Python-3000] Removing __del__
In-Reply-To: <B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
Message-ID: <bbaeab100609221206p271d981cr7fc637c4b9fc1193@mail.gmail.com>

On 9/22/06, Raymond Hettinger <rhettinger at ewtllc.com> wrote:
>
> [Michael Chermside]
> > I don't seem to have gotten anyone one board with the bold
> > proposal to just rip __del__ out and tell people to learn
> > to use weakrefs.  But I'm hearing general agreement (at least
> > among those contributing to this thread) that it might be
> > wise to change the status quo.
>
> I'm on-board for just ripping out __del__.



Same here.  I have just been busy with other stuff to not make this thread a
priority, partially because I still remember when Tim proposed this and said
there was something slightly off with the way weakrefs worked for it to be
the perfect solution.

-Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060922/1f490b48/attachment-0001.html 

From rasky at develer.com  Sat Sep 23 00:16:58 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Sat, 23 Sep 2006 00:16:58 +0200
Subject: [Python-3000] Removing __del__
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
Message-ID: <008901c6de94$d2072ed0$4bbd2997@bagio>

Raymond Hettinger wrote:

> Is there anything vital that could be done with a __close__ method
> that can't already be done with a weakref callback?  We aren't going
> to need it.

It can't be done with the same cleaness and easyness. It will require more
convoluted and complex code. It will require people to understand weakrefs in
the first place.

Did you actually read my posts where I have shown some legitimate use cases of
__del__ which can't be substituted with short and elegant enough code?

Giovanni Bajo


From cfbolz at gmx.de  Fri Sep 22 20:45:27 2006
From: cfbolz at gmx.de (Carl Friedrich Bolz)
Date: Fri, 22 Sep 2006 20:45:27 +0200
Subject: [Python-3000] Removing __del__
In-Reply-To: <fb6fbf560609221052p3edcd120p4136b6ff4b7eb595@mail.gmail.com>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>	
	<451416B4.1060201@gmx.de>
	<fb6fbf560609221052p3edcd120p4136b6ff4b7eb595@mail.gmail.com>
Message-ID: <45142F47.4050600@gmx.de>

Jim Jewett wrote:
> On 9/22/06, Carl Friedrich Bolz <cfbolz at gmx.de> wrote:
>> I still think a rather nice solution would be to guarantee to call
>> __del__ (or __close__ or whatever) only once, as was discussed earlier:
>
> How does this help?

It helps by removing many corner cases in the GC that come from objects
reviving themselves (and putting themselves into a cycle, for example).
It makes reviving an object perfectly ok, since the strange things start
to happen when an object continuously revives itself again and again.

> It doesn't say how to resolve cycles.  This cycle problem is the cause
> of much implementation complexity and most user frustration (because
> the method doesn't get called).

But the above proposal is independent from the question how cycles with
finalizers get resolved. We could still say that it does so in an
arbitrary order. My point is more that just allowing objects to be
finalized in arbitrary order does not solve the problem of objects
continuously reviving themselves.

> Once-only does prevent objects from usefully reviving them*selves*,
> but it doesn't prevent them from creating a revived copy.  Since you
> still have to start at the top of a tree, they can even reuse
> otherwise-dead subobjects -- which keeps most of the rest of the
> complexity.
>
> And to be honest, I'm not sure you *can* remove the complexity, so
> much as you can move it.  Enforcing no-revival-even-of-subobjects is
> the same tricky maze in reverse.  Saying "We don't make any promises
> regarding revival" just leads to inconsistency, and make the bugs
> subtler.
>
> The advantage of the __close__ semantics is that it greatly reduces
> the number of unbreakable cycles; this still doesn't avoid corner
> cases, but it simplifies the average case, and therefore the typical
> user experience.

See above. Calling __del__ once is an independent issue from how to
break cycles.

Cheers,

Carl Friedrich Bolz

From rhettinger at ewtllc.com  Sat Sep 23 01:24:48 2006
From: rhettinger at ewtllc.com (Raymond Hettinger)
Date: Fri, 22 Sep 2006 16:24:48 -0700
Subject: [Python-3000] Removing __del__
In-Reply-To: <023701c6dc34$8a79dc50$a14c2597@bagio>
References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>
	<023701c6dc34$8a79dc50$a14c2597@bagio>
Message-ID: <451470C0.8060903@ewtllc.com>

Giovanni Bajo wrote:

>I don't use __del__ much. I use it only in leaf classes, where it surely can't
>be part of loops. In those rare cases, it's very useful to me. For instance, I
>have a small classes which wraps an existing handle-based C API exported to
>Python. Something along the lines of:
>
>class Wrapper:
>    def __init__(self, *args):
>           self.handle = CAPI.init(*args)
>
>    def __del__(self, *args):
>            CAPI.close(self.handle)
>
>    def foo(self):
>            CAPI.foo(self.handle)
>
>The real class isn't much longer than this (really). How do you propose to
>write this same code without __del__?
>  
>
Use weakref and apply the usual idioms for the callbacks:

class Wrapper:
    def __init__(self, *args):
           self.handle = CAPI.init(*args)
	   self._wr = weakref.ref(self, lambda wr, h=self.handle: CAPI.close(h))

    def foo(self):
            CAPI.foo(self.handle)



Raymond


From aahz at pythoncraft.com  Sat Sep 23 01:56:02 2006
From: aahz at pythoncraft.com (Aahz)
Date: Fri, 22 Sep 2006 16:56:02 -0700
Subject: [Python-3000] Removing __del__
In-Reply-To: <008901c6de94$d2072ed0$4bbd2997@bagio>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
Message-ID: <20060922235602.GA3427@panix.com>

On Sat, Sep 23, 2006, Giovanni Bajo wrote:
>
> Did you actually read my posts where I have shown some legitimate use
> cases of __del__ which can't be substituted with short and elegant
> enough code?

The question is whether those use cases are frequent enough -- especially
for less-than-wizard programmers -- to warrant keeping __del__ around.
-- 
Aahz (aahz at pythoncraft.com)           <*>         http://www.pythoncraft.com/

"LL YR VWL R BLNG T S"  -- www.nancybuttons.com

From bob at redivi.com  Sat Sep 23 02:35:50 2006
From: bob at redivi.com (Bob Ippolito)
Date: Fri, 22 Sep 2006 17:35:50 -0700
Subject: [Python-3000] Removing __del__
In-Reply-To: <20060922235602.GA3427@panix.com>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
Message-ID: <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>

On 9/22/06, Aahz <aahz at pythoncraft.com> wrote:
> On Sat, Sep 23, 2006, Giovanni Bajo wrote:
> >
> > Did you actually read my posts where I have shown some legitimate use
> > cases of __del__ which can't be substituted with short and elegant
> > enough code?
>
> The question is whether those use cases are frequent enough -- especially
> for less-than-wizard programmers -- to warrant keeping __del__ around.

I still haven't seen one that can't be done pretty trivially with a
weakref. Perhaps the solution is to make doing cleanup-by-weakref
easier or more obvious? Something like this maybe:

import weakref

class GarbageDisposal:
    def __init__(self):
        self.refs = set()

    def __call__(self, object, func, *args, **kw):
        def cleanup(ref):
            self.refs.remove(ref)
            func(*args, **kw)
        self.refs.add(weakref.ref(object, cleanup))

on_cleanup = GarbageDisposal()

class Wrapper:
    def __init__(self, *args):
        self.handle = CAPI.init(*args)
        on_cleanup(self, CAPI.close, self.handle)

    def foo(self):
        CAPI.foo(self.handle)

-bob

From greg.ewing at canterbury.ac.nz  Sat Sep 23 04:08:11 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sat, 23 Sep 2006 14:08:11 +1200
Subject: [Python-3000] Removing __del__
In-Reply-To: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
Message-ID: <4514970B.7070806@canterbury.ac.nz>

Chermside, Michael wrote:
> I don't seem to have gotten anyone one board with the bold proposal
> to just rip __del__ out and tell people to learn to use weakrefs.

Well, I'd be in favour of it. I've argued something
similar in the past, without much success then either.

--
Greg

From greg.ewing at canterbury.ac.nz  Sat Sep 23 04:20:53 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sat, 23 Sep 2006 14:20:53 +1200
Subject: [Python-3000] Removing __del__
In-Reply-To: <bbaeab100609221206p271d981cr7fc637c4b9fc1193@mail.gmail.com>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<bbaeab100609221206p271d981cr7fc637c4b9fc1193@mail.gmail.com>
Message-ID: <45149A05.4030409@canterbury.ac.nz>

Brett Cannon wrote:
> I still remember when Tim proposed 
> this and said there was something slightly off with the way weakrefs 
> worked for it to be the perfect solution.

If that's true, it might be better to concentrate on
fixing this problem so that weakrefs can be used,
rather than trying to patch up __del__.

--
Greg

From rasky at develer.com  Sat Sep 23 10:01:32 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Sat, 23 Sep 2006 10:01:32 +0200
Subject: [Python-3000] Removing __del__
References: <20060919053609.vp8duwukq7sw4w48@login.werra.lunarpages.com>
	<023701c6dc34$8a79dc50$a14c2597@bagio>
	<451470C0.8060903@ewtllc.com>
Message-ID: <02ac01c6dee6$7bbee430$4bbd2997@bagio>

Raymond Hettinger wrote:

>> I don't use __del__ much. I use it only in leaf classes, where it
>> surely can't be part of loops. In those rare cases, it's very useful
>> to me. For instance, I have a small classes which wraps an existing
>> handle-based C API exported to Python. Something along the lines of:
>>
>> class Wrapper:
>>    def __init__(self, *args):
>>           self.handle = CAPI.init(*args)
>>
>>    def __del__(self, *args):
>>            CAPI.close(self.handle)
>>
>>    def foo(self):
>>            CAPI.foo(self.handle)
>>
>> The real class isn't much longer than this (really). How do you
>> propose to write this same code without __del__?
>>
>>
> Use weakref and apply the usual idioms for the callbacks:
>
> class Wrapper:
>     def __init__(self, *args):
>            self.handle = CAPI.init(*args)
>    self._wr = weakref.ref(self, lambda wr, h=self.handle:
> CAPI.close(h))
>
>     def foo(self):
>             CAPI.foo(self.handle)

What happens if self.handle changes? Or if it's closed, so that weakref should
be destroyed? You will have to bookkeep _wr everywhere across the class code.

You're proposing to remove a simple method that is easy to use and explain, but
that can cause complex problems in some cases (cycles). The alternative is a
complex finalization system, which uses weakrefs, delayed function calls, and
must be written smartly to avoid keeping references to "self". I don't see this
as a progress. On the other hand, __close__ is easy to understand, maintain,
and would solve one problem of __del__.

I think what Python 2.x really needs is a better way (library + interpreter
support) to debug cycles (both collectable and uncollectable, as the frormer
type can be just as bad as the latter time in real-time applications). Removing
__del__ just complicates real-world use cases without providing a comprehensive
solution to the problem.

Giovanni Bajo


From rasky at develer.com  Sat Sep 23 11:13:41 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Sat, 23 Sep 2006 11:13:41 +0200
Subject: [Python-3000] Removing __del__
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com><B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com><008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
Message-ID: <035f01c6def0$900594c0$4bbd2997@bagio>

Aahz wrote:

>> Did you actually read my posts where I have shown some legitimate use
>> cases of __del__ which can't be substituted with short and elegant
>> enough code?
>
> The question is whether those use cases are frequent enough --
> especially for less-than-wizard programmers -- to warrant keeping
> __del__ around.

What I am basically against is the need of removing an easy syntax which can
have problematic side effects if you are not adult enough, in favor of a
complicated library workaround which requires deeper knowledge of Python
(weakrefs, lambdas, early binding of default arguments, just to name three),
and can cause side effects just as bad if you are not adult enough. Where's the
trade-off?

On the other hand, __close__ (out-of-order, recallable __del__) fixes some
issues of __del__, it is easy to teach and understand, it is easy to write.

And, if we could (optionally) raise a RuntimeError as soon an object with
__del__ enters a loop, would your opinion about it be different?

Giovanni Bajo


From rasky at develer.com  Sat Sep 23 11:18:48 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Sat, 23 Sep 2006 11:18:48 +0200
Subject: [Python-3000] Removing __del__
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com><B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com><008901c6de94$d2072ed0$4bbd2997@bagio><20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
Message-ID: <039d01c6def1$46df1ef0$4bbd2997@bagio>

Bob Ippolito wrote:

> import weakref
> 
> class GarbageDisposal:
>     def __init__(self):
>         self.refs = set()
> 
>     def __call__(self, object, func, *args, **kw):
>         def cleanup(ref):
>             self.refs.remove(ref)
>             func(*args, **kw)
>         self.refs.add(weakref.ref(object, cleanup))
> 
> on_cleanup = GarbageDisposal()
> 
> class Wrapper:
>     def __init__(self, *args):
>         self.handle = CAPI.init(*args)
>         on_cleanup(self, CAPI.close, self.handle)
> 
>     def foo(self):
>         CAPI.foo(self.handle)

Try with this:

class Wrapper2:
   def __init__(self, *args):
        self.handle = CAPI.init(*args)

   def foo(self):
        CAPI.foo(self.handle)

   def restart(self):
        self.handle = CAPI.restart(self.handle)

   def close(self):
        CAPI.close(self.handle)
        self.handle = None

   def __del__(self):
         if self.handle is not None:
                self.close()


Giovanni Bajo


From bob at redivi.com  Sat Sep 23 11:22:07 2006
From: bob at redivi.com (Bob Ippolito)
Date: Sat, 23 Sep 2006 02:22:07 -0700
Subject: [Python-3000] Removing __del__
In-Reply-To: <039d01c6def1$46df1ef0$4bbd2997@bagio>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
Message-ID: <6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>

On 9/23/06, Giovanni Bajo <rasky at develer.com> wrote:
> Bob Ippolito wrote:
>
> > import weakref
> >
> > class GarbageDisposal:
> >     def __init__(self):
> >         self.refs = set()
> >
> >     def __call__(self, object, func, *args, **kw):
> >         def cleanup(ref):
> >             self.refs.remove(ref)
> >             func(*args, **kw)
> >         self.refs.add(weakref.ref(object, cleanup))
> >
> > on_cleanup = GarbageDisposal()
> >
> > class Wrapper:
> >     def __init__(self, *args):
> >         self.handle = CAPI.init(*args)
> >         on_cleanup(self, CAPI.close, self.handle)
> >
> >     def foo(self):
> >         CAPI.foo(self.handle)
>
> Try with this:
>
> class Wrapper2:
>    def __init__(self, *args):
>         self.handle = CAPI.init(*args)
>
>    def foo(self):
>         CAPI.foo(self.handle)
>
>    def restart(self):
>         self.handle = CAPI.restart(self.handle)
>
>    def close(self):
>         CAPI.close(self.handle)
>         self.handle = None
>
>    def __del__(self):
>          if self.handle is not None:
>                 self.close()

I've never seen an API that works like that. Have you?

-bob

From rasky at develer.com  Sat Sep 23 11:39:20 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Sat, 23 Sep 2006 11:39:20 +0200
Subject: [Python-3000] Removing __del__
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
Message-ID: <03bb01c6def4$257b6c70$4bbd2997@bagio>

Bob Ippolito wrote:

>> class Wrapper2:
>>    def __init__(self, *args):
>>         self.handle = CAPI.init(*args)
>>
>>    def foo(self):
>>         CAPI.foo(self.handle)
>>
>>    def restart(self):
>>         self.handle = CAPI.restart(self.handle)
>>
>>    def close(self):
>>         CAPI.close(self.handle)
>>         self.handle = None
>>
>>    def __del__(self):
>>          if self.handle is not None:
>>                 self.close()
>
> I've never seen an API that works like that. Have you?

The class above shows a case where:

1) There's a way to destruct the handle BEFORE __del__ is called, which would
require killing the weakref / deregistering the finalization hook. I believe
you agree that this is pretty common (I've around 10 usages of this pattern,
__del__ with a separate explicit closure method, in one Python base-code of
mine).

2) The objects required in the destructor can be mutated / changed during the
lifetime of the instance. For instance, a class that wraps Win32
FindFirstFirst/FindFirstNext and support transparent directory recursion needs
something similar. Or CreateToolhelp32Snapshot() with the Module32First/Next
stuff. Another example is a class which creates named temporary files and needs
to remove them on finalization. It might need to create several different
temporary files (say, self.handle is the filename in that case)[1], so the
filename needed in the destructor changes during the lifetime of the instance.

#2 is admittedly more convoluted (and probably more rare) than #1, but it's
still a reasonable use case which really you can't easily do with a simple
finalization API like the one you were proposing. Python is turing-complete
without __del__, but in some cases the alternatives are *really* worse.

Giovanni Bajo

[1] tempfile.NamedTemporaryFile can't always be used because it does not
guarantee that the file can be reopened; for instance, zipfile.Zipfile() wants
a filename, so if you want to create a temporary ZipFile you can't use
tempfile.NamedTemporaryFile.


From martin at v.loewis.de  Sat Sep 23 13:33:14 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 23 Sep 2006 13:33:14 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <aac2c7cb0609201047g399d095ck7f64a367764b520f@mail.gmail.com>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>	<20060920083244.0817.JCARLSON@uci.edu>
	<aac2c7cb0609201047g399d095ck7f64a367764b520f@mail.gmail.com>
Message-ID: <45151B7A.50000@v.loewis.de>

Adam Olsen schrieb:
> Just a minor nit.  I doubt we could accept UCS-2, we'd want UTF-16
> instead, with all the variable-width goodness that brings in.

Sure we could; we can currently.

> Or maybe not so minor.  Old versions of windows used UCS-2, new
> versions use UTF-16.  The former should get errors if too high of a
> character is used, the latter will need conversion if we're not using
> UTF-16.

Define "used". Surrogate pairs work well in the NTFS of Windows NT 3.1;
no errors are reported.

Regards,
Martin

From martin at v.loewis.de  Sat Sep 23 13:38:02 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 23 Sep 2006 13:38:02 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>	<ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>
	<aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>
Message-ID: <45151C9A.3090908@v.loewis.de>

Adam Olsen schrieb:
> As far as I can tell, CPython on windows uses UTF-16 with code units.
> Perhaps not intentionally, but by default (not throwing an error on
> surrogates).

It's intentionally; that's what PEP 261 specifies.

Regards,
Martin

From martin at v.loewis.de  Sat Sep 23 13:50:36 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 23 Sep 2006 13:50:36 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <015c01c6dd68$bb6bcfa0$e303030a@trilan>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com><ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com><aac2c7cb0609201602l7e4dca43i330c369c0af265@mail.gmail.com><ca471dc20609201608r6779a352te79325e8f058a586@mail.gmail.com><aac2c7cb0609201620v297d61e1je2d5709418b9c2fa@mail.gmail.com>	<ca471dc20609201855t711c22bev500baa7eb8371e09@mail.gmail.com>
	<015c01c6dd68$bb6bcfa0$e303030a@trilan>
Message-ID: <45151F8C.2070304@v.loewis.de>

Giovanni Bajo schrieb:
> Is there a design document explaining the rationale of unicode type, the
> status quo?

There is a document documenting the status quo: the source code.
Contributors to this thread (or, for that matter, to this mailing
list) should really familiarize themselves with the source code before
posting - nobody is willing to answer question that can be answered just
by looking at the source code.

Now, there might be questions like "why is this or that done that way?"
People are more open to answer questions like that if the poster
demonstrates that he knows what the way is, and can suggest theories
as to why it might be the way it is.

> Any time this subject is raised on the mailing list, the net
> result is "you guys don't understand unicode". Well, let us know what is
> good and what is bad of the current unicode type; what is by design and what
> is an implementation detail; what you want to absolutely keep, and what you
> want to absolutely change. I am *really* confused about the status quo of
> the unicode type (which is why I keep myself out of technical discussions on
> the matter of course). Is there any desire to let people understand and join
> the discussion?

It's clear that there should be only a single character string type, and
that should be close to the current Unicode type, in semantics and
implementation.

Regards,
Martin

From martin at v.loewis.de  Sat Sep 23 14:01:56 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 23 Sep 2006 14:01:56 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <45126E76.9020600@nekomancer.net>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>	<ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
	<45126E76.9020600@nekomancer.net>
Message-ID: <45152234.1090303@v.loewis.de>

G?bor Farkas schrieb:
> while i understand the constraints, i think it's not a good decision to 
> leave this to be implementation-dependent.
> 
> the strings seem to me as such a basic functionality, that it's 
> behaviour should not depend on the platform.
> 
> for example, how is an application developer then supposed to write 
> their applications?

An application developer should always know what the target platforms
are. For example, does the code need to work with IronPython or not?
Python is not aiming at 100% portability at all costs. Many aspects
are platform dependent, and while this has complicated some
applications, is has simplified others (which could make use of
platform details that otherwise would not have been exposed to the
Python programmer).

> should he write his own slicing/whatever functions to get consistent 
> behaviour on linux/windows?

Depends on the application, and the specific slicing operations.
If the slicing appears in the processing of .ini files (say),
no platform-dependent slicing should be necessary.

> i think this is not just a 'theoretical' issue. it's a very practical 
> issue. the only reason why it does not seem to be important, because 
> currently not much of the non-16-bit unicode characters are used.

No, there is a deeper reason. A typical program only performs substring
operations on selected boundaries (such as whitespace, or punctuation).
Those are typically in the BMP (not sure whether *any* punctuation
is outside the BMP).

> but the same way i could say, that because most of the unix-world is 
> utf-8, for those pythons the best way is to handle it internally as 
> utf-8, couldn't i?

I think you live in a free country: you can certainly say that.
I think you would be wrong. The common on-disk/on-wire representation
of text should not influence the design of an in-memory representation.

> it simply seems to me strange to make compromises that makes the life of 
> the cpython-users harder, just to make the life for the 
> jython/ironpython developers (i mean the 'creators') easier.

Guido didn't say that the life of the CPython user needs to be hard.
He said it will be implementation-dependent, referring to Jython
and IronPython. Whether or not CPython uses a consistent representation
or consistent python-level experience across platforms is a different
issue. CPython could behave absolutely consistently, and use four-byte
Unicode on all systems, and the length of a non-BMP string would
still be implementation-defined.

Regards,
Martin

From martin at v.loewis.de  Sat Sep 23 14:09:00 2006
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Sat, 23 Sep 2006 14:09:00 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <4511E644.2030306@blueyonder.co.uk>
References: <aac2c7cb0609200555y418c4c14n2c7e22e53235a468@mail.gmail.com>	<ca471dc20609200940q2f0f1e68vb4b6c943d180cc17@mail.gmail.com>	<aac2c7cb0609201120i4c4da950vc0ae7072513c2e7c@mail.gmail.com>	<bbaeab100609201130q793247e4p676d9e00b15f9038@mail.gmail.com>
	<4511E644.2030306@blueyonder.co.uk>
Message-ID: <451523DC.2050901@v.loewis.de>

David Hopwood schrieb:
>> Assuming my Unicode lingo is right and code point represents a
>> letter/character/digraph/whatever, then it will be a code point.  Doing one
>> of my rare channels of Guido, I *really* doubt he wants to expose the
>> technical details of Unicode to the point of having people need to realize
>> that UTF-8 takes two bytes to represent "?".
> 
> The argument used here is not valid. People do need to realize that *all*
> Unicode encodings are variable-length, in the sense that abstract characters
> can be represented by multiple code points.

Brett did not make such an argument. He made an argument that users
should not need to care that "?" in UTF-8 is two bytes. And I agree:
users should not have to worry about this wrt. internal representation.

> For example, "?" can be represented either as the precomposed character U+00F6,
> or as "o" followed by a combining diaeresis (U+006F U+0308). Programs must
> avoid splitting sequences of code points that represent a single abstract
> character.

Why is that? Many programs never encounter cases where this would
matter, so why do such program have to operate correctly if that case
was encountered?

> It simply is not possible to do correct string processing in Unicode that
> will "work the way [programmers] are used to when compared to working in ASCII".

Brett didn't say that this was a goal.

> Should we nevertheless try to avoid making the use of Unicode strings
> unnecessarily difficult for people who have minimal knowledge of Unicode?
> Absolutely, but not at the expense of making basic operations on strings
> asymptotically less efficient. O(1) indexing and slicing is a basic
> requirement, even if it has to be done using code units.

It's not possible to implement slicing in constant time, unless string
views are introduced. Currently, slicing takes time linear with the
length of the result string.

Regards,
Martin


From nas at arctrix.com  Sat Sep 23 17:45:33 2006
From: nas at arctrix.com (Neil Schemenauer)
Date: Sat, 23 Sep 2006 15:45:33 +0000 (UTC)
Subject: [Python-3000] Removing __var
References: <E1GQoTH-0003Zp-F7@swing.co.at> <200609221329.17334.fdrake@acm.org>
	<6a36e7290609221102r757fa9e2r7bdf32b2e31f6eb1@mail.gmail.com>
	<ef19g8$c88$2@sea.gmane.org>
	<6a36e7290609221131v55d52401s59a74e55b8575376@mail.gmail.com>
Message-ID: <ef3kqt$8jj$1@sea.gmane.org>

Bob Ippolito <bob at redivi.com> wrote:
> The point is that legitimate __ usage is supposedly so rare that this
> verbosity doesn't matter. If it's verbose, people definitely won't use
> it until they need to, where right now people do it all the time cause
> it's "private".

It's very rare, in my experience.  I vote to rip it out.

  Neil


From qrczak at knm.org.pl  Sat Sep 23 18:34:20 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Sat, 23 Sep 2006 18:34:20 +0200
Subject: [Python-3000] Removing __del__
In-Reply-To: <03bb01c6def4$257b6c70$4bbd2997@bagio> (Giovanni Bajo's message
	of "Sat, 23 Sep 2006 11:39:20 +0200")
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio>
Message-ID: <8764fesfsj.fsf@qrnik.zagroda>

"Giovanni Bajo" <rasky at develer.com> writes:

> 1) There's a way to destruct the handle BEFORE __del__ is called,
> which would require killing the weakref / deregistering the
> finalization hook.

Weakrefs should have a method which runs their callback and
unregisters them.

> 2) The objects required in the destructor can be mutated / changed
> during the lifetime of the instance. For instance, a class that
> wraps Win32 FindFirstFirst/FindFirstNext and support transparent
> directory recursion needs something similar.

Listing files with transparent directory recursion can be implemented
in terms of listing files of a given directory, such that a finalizer
is only used with the low level object.

> Another example is a class which creates named temporary files
> and needs to remove them on finalization. It might need to create
> several different temporary files (say, self.handle is the filename
> in that case)[1], so the filename needed in the destructor changes
> during the lifetime of the instance.

Again: move the finalizer to a single temporary file object, and refer
to such object instead of a raw handle.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From bob at redivi.com  Sat Sep 23 19:18:25 2006
From: bob at redivi.com (Bob Ippolito)
Date: Sat, 23 Sep 2006 10:18:25 -0700
Subject: [Python-3000] Removing __del__
In-Reply-To: <03bb01c6def4$257b6c70$4bbd2997@bagio>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio>
Message-ID: <6a36e7290609231018m65bed7edq116ab807aa5cf6f5@mail.gmail.com>

On 9/23/06, Giovanni Bajo <rasky at develer.com> wrote:
> Bob Ippolito wrote:
>
> >> class Wrapper2:
> >>    def __init__(self, *args):
> >>         self.handle = CAPI.init(*args)
> >>
> >>    def foo(self):
> >>         CAPI.foo(self.handle)
> >>
> >>    def restart(self):
> >>         self.handle = CAPI.restart(self.handle)
> >>
> >>    def close(self):
> >>         CAPI.close(self.handle)
> >>         self.handle = None
> >>
> >>    def __del__(self):
> >>          if self.handle is not None:
> >>                 self.close()
> >
> > I've never seen an API that works like that. Have you?
>
> The class above shows a case where:
>
> 1) There's a way to destruct the handle BEFORE __del__ is called, which would
> require killing the weakref / deregistering the finalization hook. I believe
> you agree that this is pretty common (I've around 10 usages of this pattern,
> __del__ with a separate explicit closure method, in one Python base-code of
> mine).

Easy enough, that would be a second function and the dict would change a bit.

> 2) The objects required in the destructor can be mutated / changed during the
> lifetime of the instance. For instance, a class that wraps Win32
> FindFirstFirst/FindFirstNext and support transparent directory recursion needs
> something similar. Or CreateToolhelp32Snapshot() with the Module32First/Next
> stuff. Another example is a class which creates named temporary files and needs
> to remove them on finalization. It might need to create several different
> temporary files (say, self.handle is the filename in that case)[1], so the
> filename needed in the destructor changes during the lifetime of the instance.
>
> #2 is admittedly more convoluted (and probably more rare) than #1, but it's
> still a reasonable use case which really you can't easily do with a simple
> finalization API like the one you were proposing. Python is turing-complete
> without __del__, but in some cases the alternatives are *really* worse.

You can of course easily do this with a simple finalization API.
Supporting this simply requires that multiple cleanup functions be
allowed per object.

-bob

From jcarlson at uci.edu  Sat Sep 23 20:03:43 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sat, 23 Sep 2006 11:03:43 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <451523DC.2050901@v.loewis.de>
References: <4511E644.2030306@blueyonder.co.uk> <451523DC.2050901@v.loewis.de>
Message-ID: <20060923104310.0863.JCARLSON@uci.edu>


"Martin v. L?wis" <martin at v.loewis.de> wrote:
> David Hopwood schrieb:
[snip]
> > Should we nevertheless try to avoid making the use of Unicode strings
> > unnecessarily difficult for people who have minimal knowledge of Unicode?
> > Absolutely, but not at the expense of making basic operations on strings
> > asymptotically less efficient. O(1) indexing and slicing is a basic
> > requirement, even if it has to be done using code units.
> 
> It's not possible to implement slicing in constant time, unless string
> views are introduced. Currently, slicing takes time linear with the
> length of the result string.

I believe he was referring to discovering the memory address where
slicing should begin.  In the case of Latin-1, UCS-2, or UCS-4, given a
starting address and some position i, it is trivial to discover the
memory position of character i.  In the case of UTF-8, given a starting
address and some position i, one needs to somewhat parse the UTF-8
representation to discover the memory position of character i.


For me, having recently remembered what was in a unicode string, and
verifying it by checking the source, the question in my mind is whether
we want to stick with the same 2-representation implementation (default
encoding and UTF-16 or UCS-4 depending on build), or go with more or
fewer representations.

We can reduce memory consumption by using a single representation,
whether it be constant or variable based on content, though in some
cases (utf-16, ucs-4) we would lose the 'native' single-segment char (C
char) buffer interface.

Using multiple representations, and choosing those representations
carefully based on platform (always keep utf-8 as one of the
representations on linux, always keep utf-16 as one of the
representations in Windows), we may be able to increase platform API
calling speed, if such is desireable.

After re-reading the source, and thinking a bit more, about my only
real concern is memory use of Python 3.x .  The current implementation
works, so I'm +1 on keeping it "as is", but I'm also +0 on some
implementation that would reduce memory use (with limited, if any
slowdown) for as many platforms as possible, not any higher because
changing the underlying implementation would be a PITA.


 - Josiah


From martin at v.loewis.de  Sat Sep 23 21:17:04 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sat, 23 Sep 2006 21:17:04 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <20060923104310.0863.JCARLSON@uci.edu>
References: <4511E644.2030306@blueyonder.co.uk> <451523DC.2050901@v.loewis.de>
	<20060923104310.0863.JCARLSON@uci.edu>
Message-ID: <45158830.8020908@v.loewis.de>

Josiah Carlson schrieb:
> For me, having recently remembered what was in a unicode string, and
> verifying it by checking the source, the question in my mind is whether
> we want to stick with the same 2-representation implementation (default
> encoding and UTF-16 or UCS-4 depending on build), or go with more or
> fewer representations.

I would personally like to see a Python API that operates on code
points, with support for 17 planes. I also think that efficient indexing
is important.

> We can reduce memory consumption by using a single representation,
> whether it be constant or variable based on content, though in some
> cases (utf-16, ucs-4) we would lose the 'native' single-segment char (C
> char) buffer interface.

I don't think reducing memory consumption is that important, for current
hardware. Java and .NET have demonstrated that you can do "real"
application with that approach.

There are trade-offs, of course. I personally think the best trade-off
would be to have a two-byte representation, along with a flag telling
whether there are any surrogate pairs in the string. Indexing and
length would be constant-time if there are no surrogates, and linear
time if there are.

> After re-reading the source, and thinking a bit more, about my only
> real concern is memory use of Python 3.x .  The current implementation
> works, so I'm +1 on keeping it "as is", but I'm also +0 on some
> implementation that would reduce memory use (with limited, if any
> slowdown) for as many platforms as possible, not any higher because
> changing the underlying implementation would be a PITA.

I think supporting multiple representations at run-time would really
be terrible. Any API of the "give me the data" kind would either have
to expose the choice of representations, or perform a copy. Either
alternative would produce many programming errors in extension modules.

Regards,
Martin

From rasky at develer.com  Sun Sep 24 02:04:36 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Sun, 24 Sep 2006 02:04:36 +0200
Subject: [Python-3000] __close__ method
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
Message-ID: <07f701c6df6d$05b47660$4bbd2997@bagio>

Micheal,

many thanks for your interesting mail, which pointed out the outcome of the
previous thread. Let me trying to answer some questions of yours about
__close__.

> But I'm hearing general agreement (at least among those contributing
> to this thread) that it might be wise to change the status quo.

Status quo of __del_:

Pros:
- Easy syntax: very simple to use in easy situations.
- Easy semantic: familiar to beginners (similarity with other programming
languages), and being the "opposite" of __init__ makes it easy to teach.

Cons:
- Makes reference loops uncollectable -> people learn fast to avoid it in most
classes
- Allow resurrection, which is a headache for Python developers


> The two kinds of solutions I'm hearing are (1) those that are based
> around making a helper object that gets stored as an attribute in
> the object, or a list of weakrefs, or something like that, and (2)
> the __close__ proposal (or perhaps keep the name __del__ but change
> the semantics.
>
> The difficulties with (1) that have been acknowledged so far are
> that the way you code things becomes somewhat less obvious, and
> that there is the possibility of accidentally creating immortal
> objects through reference loops.

Exactly. To be able to code correctly these finalizers, you need to be much
more Python savvy than you need to use __del__, because you need to understand
and somehow master:

- weakrefs
- early binding of default arguments of functions

which are not exactly the two brightest areas of Python.

[ (2) the __close__ proposal ]
> I would like to hear someone address the weaknesses of (2).
> The first I know of is that the code in your __close__ method (or
> __del__) must assume that it might have been in a reference loop
> which was broken in some arbitrary place. As a result, it cannot
> assume that all references it holds are still valid. To avoid
> crashing the system, we'd probably have to set the broken
> references to None (is that the best choice?), but can people
> really write code that has to run assuming that its references
> might be invalid?

I might be wrong, but given the constraint that __close__ could be called
multiple times for the same objects, and I don't see how this situation might
appear. The cyclic GC could:

1) call __close__ on the instances *BEFORE* dropping the references. The code
in __close__ could break the cycle itself.
2) only after that, assume that __close__ did not dispose anything related to
the loop itself, and thus drop a random reference in the chain. This would
cause other calls to __close__ on the instances, which should result in
basically no-ops since they have been already executed.

BTW: would it be possible to "nullify" the __close__ method after it has been
executed once somehow, so that it won't get executed twice on the same
instance? A single bit in the instance (with the meaning of "already closed")
should be sufficient. If this is possible, then the above algorithm is easier
to implement, and it also makes __close__ methods easier to implement.

> A second problem I know of is, what if the code stores a reference
> to self someplace? The ability for __del__ methods to resurrect
> the object being finalized is one of the major sources of
> complexity in the GC module, and changing the semantics to
> __close__ doesn't fix this.

I don't think __close__ can solve this problem, in fact. I don't specifically
consider it a weakness of __close__, strictly speaking, though.


> -------- examples only below this line --------
>
> class MyClass2(object):
>     def __init__(self, resource1_name, resource2_name):
>         self.resource1 = acquire_resource(resource1_name)
>         self.resource2 = acquire_resource(resource2_name)
>     def flush(self):
>         self.resource1.flush()
>         self.resource2.flush()
>         if hasattr(self, 'next'):
>             self.next.flush()
>     def close(self):
>         self.resource1.release()
>         self.resource2.release()
>     def __close__(self):
>         self.flush()
>         self.close()
>
> x = MyClass2('db1', 'db2')
> y = MyClass2('db3', 'db4')
> x.next = y
> y.next = x
>
> This version will encounter a problem. When the GC sees
> the x <--> y loop it will break it somewhere... without
> loss of generality, let us say it breaks the y -> x link
> by setting y.next to None. Now y will be freed, so
> __close__ will be called. __close__ will invoke self.flush()
> which will then try to invoke self.next.flush(). But
> self.next is None, so we'll get an exception and never
> make it to invoking self.close().

With my algorithm, the following things will happen:

0) I assume that the resources can be flushed() even after having been
released() without causing weird exceptions... Otherwise the code should be
more defensive, and delete the references to the resources after disposal.
1) GC will first call __close__ on either instance (let's say x). This would
close the instance by releasing the resources. x is marked as "already closed".
y.flush() is invoked.
2) GC will then call __close__ on y. This would release y's resources, and
invoke x.flush(). x.flush() would either have no side-effects, or being
defensively coded against resource1/resource2 being None (since the resources
of x have been already disposed at step 1).
3) The loop was not broken, so GC will drop a random reference. Let's say it
breaks the y -> x link. This causes x to be disposed. x is marked as "already
closed" so __close__ is not invoked. During disposal, the reference to y held
in x.next is dropped.
4) y is disposed. It's marked as "already closed" so __close__ is not invoked.



> ------
>
> The other problem I discussed is illustrated by the following
> malicious code:
>
> evil_list = []
>
> class MyEvilClass(object):
>     def __close__(self):
>         evil_list.append(self)
>
> Do the proponents of __close__ propose a way of prohibiting
> this behavior? Or do we continue to include complicated
> logic the GC module to support it? I don't think anyone
> cares how this code behaves so long as it doesn't segfault.

I can see how this can confuse the GC, but I really don't know the details. I
don't have any proposal as how to avoid this situation.

Giovanni Bajo


From jimjjewett at gmail.com  Sun Sep 24 02:38:40 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Sat, 23 Sep 2006 20:38:40 -0400
Subject: [Python-3000] Removing __del__
In-Reply-To: <6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
Message-ID: <fb6fbf560609231738u486edcb9m83c7e152ec7544fb@mail.gmail.com>

On 9/22/06, Bob Ippolito <bob at redivi.com> wrote:

> I still haven't seen one that can't be done pretty trivially
> with a weakref.  Perhaps the solution is to make
> doing cleanup-by-weakref easier or more obvious?

Possibly, but I've tried, and *I* couldn't come up with any way to use
them that was

(1) generic enough to put in a module, rather than a recipe
(2) easy enough to still be an improvement, and
(3) correct.

>     def __call__(self, object, func, *args, **kw):
>         def cleanup(ref):
>             self.refs.remove(ref)
>             func(*args, **kw)
>         self.refs.add(weakref.ref(object, cleanup))

Now remember something like Michael Chermside's "simplest" example,
where you need to flush before closing.

The obvious way is to pass self.close, but it doesn't actually work.
Because it is a bound method, it silently makes the object effectively
immortal.

The "correct" way is to write another function which is basically an
awkward copy of self.close.  At the moment, I can't think of any
*good* way to ensure access to self.resource1 and self.resource2, but
not to self.  All the workarounds I can come up with make __del__ look
pretty good, from a maintenance standpoint.

-jJ

From greg.ewing at canterbury.ac.nz  Sun Sep 24 03:12:05 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Sun, 24 Sep 2006 13:12:05 +1200
Subject: [Python-3000] Removing __del__
In-Reply-To: <035f01c6def0$900594c0$4bbd2997@bagio>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<035f01c6def0$900594c0$4bbd2997@bagio>
Message-ID: <4515DB65.2090505@canterbury.ac.nz>

Giovanni Bajo wrote:

> What I am basically against is the need of removing an easy syntax which can
> have problematic side effects if you are not adult enough,

So what are you saying, that people who aren't adult enough
should be given a tool that's nice and easy but leads them
to write buggy code? That doesn't seem like a responsible
thing to do.

> complicated library workaround which requires deeper knowledge of Python
> (weakrefs, lambdas, early binding of default arguments, just to name three),

I don't see why it needs to be anywhere near that complicated.
All use of weakrefs can be hidden behind a call such as

   register_finalizer(self, func, *args, **kwds)

and we just need to say that func should be a plain function,
not a bound method of self, and self shouldn't appear anywhere
in the arguments.

Anyone who's not intelligent enough to understand and follow
those guidelines is not intelligent enough to avoid the
pitfalls of using __del__ either, IMO.

--
Greg

From tanzer at swing.co.at  Sun Sep 24 14:04:54 2006
From: tanzer at swing.co.at (Christian Tanzer)
Date: Sun, 24 Sep 2006 14:04:54 +0200
Subject: [Python-3000] Removing __var
In-Reply-To: Your message of "Fri, 22 Sep 2006 11:31:16 PDT."
	<6a36e7290609221131v55d52401s59a74e55b8575376@mail.gmail.com>
Message-ID: <E1GRSj8-0006VK-K5@swing.co.at>


"Bob Ippolito" <bob at redivi.com> wrote:

> On 9/22/06, Thomas Heller <theller at python.net> wrote:
> > Bob Ippolito schrieb:
> > > On 9/22/06, Fred L. Drake, Jr. <fdrake at acm.org> wrote:
> > >> On Friday 22 September 2006 13:05, Christian Tanzer wrote:
> > >>  > It is useful in some situations, though. In particular, I use a
> > >>  > metaclass that sets `__super` to the right value. This wouldn't work
> > >>  > without name mangling.
> > >>
> > >> This also doesn't work if two classes in the inheritance hierarchy have the
> > >> same __name__, if I understand how you're using this.  My guess is that
> > >> you're using calls like
> > >>
> > >>     def doSomething(self, arg):
> > >>         self.__super.doSomething(arg + 1)
> > >
> > > In the one or two situations where it "is useful" you could always
> > > write out what it would've done.
> > >
> > > self._ThisClass__super.doSomething(arg + 1)
> >
> > It is much more verbose, though.  The question is are you writing
> > this more often, or are you introspecting more often?
>
> The point is that legitimate __ usage is supposedly so rare that this
> verbosity doesn't matter. If it's verbose, people definitely won't use
> it until they need to, where right now people do it all the time cause
> it's "private".

How can you say that?

I don't use __ for `private`, I use it for making cooperative super
calls (and `__super` occurs 1397 in my sandbox). I definitely don't
*want* to put the name of the class into a cooperative call. Compare

    self.__super.doSomething(arg + 1)

with

    super(SomeClass, self).doSomething (arg + 1)

The literal class name is verbose, error prone, and hostile to
refactoring.

I don't care about people supposedly abusing __ to define `private`
attributes -- we are all consenting adults here. (And people trying to
restrict visibility probably commit all sorts of blunders. Trying to
stop that might mean taking away most of Python's features).

-- 
Christian Tanzer                                    http://www.c-tanzer.at/


From fredrik at pythonware.com  Sun Sep 24 14:42:55 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Sun, 24 Sep 2006 14:42:55 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <45158830.8020908@v.loewis.de>
References: <4511E644.2030306@blueyonder.co.uk>
	<451523DC.2050901@v.loewis.de>	<20060923104310.0863.JCARLSON@uci.edu>
	<45158830.8020908@v.loewis.de>
Message-ID: <ef5ugf$d9a$1@sea.gmane.org>

Martin v. L?wis wrote:

> I don't think reducing memory consumption is that important, for current
> hardware. Java and .NET have demonstrated that you can do "real"
> application with that approach.

I've spent more time optimizing Python's string types than most, and 
that doesn't match my experiences at all.  Operations on wide chars are 
often faster than one might think, but any processor can copy X bytes of 
data faster than it can copy X*4 bytes of data, and I doubt that's going 
to change soon.

> I think supporting multiple representations at run-time would really
> be terrible. Any API of the "give me the data" kind would either have
> to expose the choice of representations, or perform a copy.

Unless you can guarantee that *all* external API:s that a Python 
extension might want to use will use exactly the same internal 
representation as Python, that's something that we have to deal with anyway.

> Either alternative would produce many programming errors in extension
 > modules.

And even if that was true (which I don't believe), "many" would still
be "very small" compared to the problems that reference counting and 
error handling is causing.

</F>


From martin at v.loewis.de  Sun Sep 24 18:31:12 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 24 Sep 2006 18:31:12 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ef5ugf$d9a$1@sea.gmane.org>
References: <4511E644.2030306@blueyonder.co.uk>	<451523DC.2050901@v.loewis.de>	<20060923104310.0863.JCARLSON@uci.edu>	<45158830.8020908@v.loewis.de>
	<ef5ugf$d9a$1@sea.gmane.org>
Message-ID: <4516B2D0.9020109@v.loewis.de>

Fredrik Lundh schrieb:
>> I don't think reducing memory consumption is that important, for current
>> hardware. Java and .NET have demonstrated that you can do "real"
>> application with that approach.
> 
> I've spent more time optimizing Python's string types than most, and 
> that doesn't match my experiences at all.  Operations on wide chars are 
> often faster than one might think, but any processor can copy X bytes of 
> data faster than it can copy X*4 bytes of data, and I doubt that's going 
> to change soon.

These statements don't contradict. You are saying that there is a
measurable, perhaps significant difference between copying of
single-byte vs. double-byte strings. I can believe this.

My claim is that this still isn't that important, and that it will
be "fast enough", anyway. In many cases, the application will be
IO-bound, so the cost of string operations might be negligible,
either way.

Of course, both statements generalize across an unspecified set of
applications, so it is a matter of personal preferences.

>> I think supporting multiple representations at run-time would really
>> be terrible. Any API of the "give me the data" kind would either have
>> to expose the choice of representations, or perform a copy.
> 
> Unless you can guarantee that *all* external API:s that a Python 
> extension might want to use will use exactly the same internal 
> representation as Python, that's something that we have to deal with anyway.

APIs will certainly allow different kinds of memory buffers to
create a Python string object. Creation is a fairly small part
of the API; I believe it would noticeably simplify the
implementation if there is only a single internal representation.


>> Either alternative would produce many programming errors in extension
>  > modules.
> 
> And even if that was true (which I don't believe), "many" would still
> be "very small" compared to the problems that reference counting and 
> error handling is causing.

We will see. We need a specification or implementation first to see,
of course.

Regards,
Martin

From talin at acm.org  Sun Sep 24 20:55:47 2006
From: talin at acm.org (Talin)
Date: Sun, 24 Sep 2006 11:55:47 -0700
Subject: [Python-3000] Transitional GC?
Message-ID: <4516D4B3.50905@acm.org>

I wonder if there is a way to create an API for extension modules that 
would allow a gradual phase-out of reference counting, towards a 'pure' GC.

(Let's leave aside the merits of reference counting vs. non-reference 
counting for another thread - please.)

Most of the discussion up to this point has assumed that there's a sharp 
line between the two GC schemes - in other words, once you switch over, 
you have to migrate every extension module all at once.

I've been wondering, however, if there isn't some way for both schemes 
to coexist within the same interpreter, for some transitional period. 
You would have some modules that use the RC API, while other modules 
would use the 'tracing' API. Modules could gradually be ported to the 
new API until there were none left, at which point you could throw the 
switch and remove the RC support entirely.

I'm assuming two things here:
   1) That such a transitional scheme would have to be as efficient (or 
nearly so) as the existing scheme in terms of memory and speed.
   2) That we're talking source-level compatibility only - there's no 
expectation that you would be able to link with modules compiled under 
the old API.

I see two basic approaches to this. The first is to have 
reference-counting modules live in a predominately trace-based world; 
The other is to allow tracing modules to live in a predominately 
reference-counted world.

The first approach is relatively straightforward - you simply add any 
object with a non-zero refcount to the root set. Objects whose refcounts 
fall to zero are not immediately deleted, but instead get placed into 
the youngest generation to be traced and collected.

The second approach requires that an object be able to manage refcounts 
via its trace function.

Consider what an extension module looks like under a tracing regime. 
Each extension class is required to provide a 'trace' function that 
iterates through all references held by an object.

The 'trace' function need not know the purpose of the trace - in other 
words, it need not know *why* the references are being iterated, its 
only concern is to provide access to each references. This is most 
easily accomplished by passing a callback function to the trace 
function. The trace function iterates through the object's references 
and calls the callback once for each one.

Because the extension modules doesn't know why the references are being 
traced, this gives us the freedom to redefine what a 'trace' means at 
various points in the transition.

So one scheme would allow a 'traced' object to exist in a 
reference-counted world by using the trace function to release 
references. When an object is destroyed, the trace function is called, 
and the callback releases the reference.

Dealing with mutation of references is trickier - there's a couple of 
approaches I've thought of by none are particularly efficient. I guess 
the traced object will have to call the old refcounting functions, but 
via macros which can be no-op'd later.

-- Talin


From rasky at develer.com  Sun Sep 24 21:50:01 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Sun, 24 Sep 2006 21:50:01 +0200
Subject: [Python-3000] Removing __var
References: <6a36e7290609221131v55d52401s59a74e55b8575376@mail.gmail.com>
	<E1GRSj8-0006VK-K5@swing.co.at>
Message-ID: <0d2101c6e012$9f88be90$4bbd2997@bagio>

Christian Tanzer wrote:

> I don't use __ for `private`, I use it for making cooperative super
> calls (and `__super` occurs 1397 in my sandbox).

I think you might be confusing the symptom for the disease. To me, your mail
means that Py3k should grow some syntactic sugar for super calls. I guess if
that happens, you won't be missing __.

Giovanni Bajo


From martin at v.loewis.de  Sun Sep 24 22:00:35 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Sun, 24 Sep 2006 22:00:35 +0200
Subject: [Python-3000] Transitional GC?
In-Reply-To: <4516D4B3.50905@acm.org>
References: <4516D4B3.50905@acm.org>
Message-ID: <4516E3E3.6000005@v.loewis.de>

Talin schrieb:
> I wonder if there is a way to create an API for extension modules that 
> would allow a gradual phase-out of reference counting, towards a 'pure' GC.
> 
> (Let's leave aside the merits of reference counting vs. non-reference 
> counting for another thread - please.)
> 
> Most of the discussion up to this point has assumed that there's a sharp 
> line between the two GC schemes - in other words, once you switch over, 
> you have to migrate every extension module all at once.

I think this is a minor issue. Your approach assumes that moving to
a tracing GC will require module authors to change their code. Perhaps
that isn't necessary. It is difficult to tell, in the abstract, whether
your proposal works or not.

Regards,
Martin

From jcarlson at uci.edu  Sun Sep 24 23:45:36 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sun, 24 Sep 2006 14:45:36 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <45158830.8020908@v.loewis.de>
References: <20060923104310.0863.JCARLSON@uci.edu>
	<45158830.8020908@v.loewis.de>
Message-ID: <20060924144006.086D.JCARLSON@uci.edu>


"Martin v. L?wis" <martin at v.loewis.de> wrote:
> Josiah Carlson schrieb:
> > For me, having recently remembered what was in a unicode string, and
> > verifying it by checking the source, the question in my mind is whether
> > we want to stick with the same 2-representation implementation (default
> > encoding and UTF-16 or UCS-4 depending on build), or go with more or
> > fewer representations.
> 
> I would personally like to see a Python API that operates on code
> points, with support for 17 planes. I also think that efficient indexing
> is important.

Fully-featured unicode would be nice.


> There are trade-offs, of course. I personally think the best trade-off
> would be to have a two-byte representation, along with a flag telling
> whether there are any surrogate pairs in the string. Indexing and
> length would be constant-time if there are no surrogates, and linear
> time if there are.

What about a tree structure over the top of the string as I described in
another post?  If there are no surrogate pairs, the pointer to the tree
is null.  If there are surrogate pairs, we could either use the
structure as I described, or even modify it so that we get even better
memory utilization/performance (choose tree nodes based on where
surrogate pairs are, up to some limit).

 - Josiah


From jcarlson at uci.edu  Sun Sep 24 23:54:21 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sun, 24 Sep 2006 14:54:21 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ef5ugf$d9a$1@sea.gmane.org>
References: <45158830.8020908@v.loewis.de> <ef5ugf$d9a$1@sea.gmane.org>
Message-ID: <20060924144621.0870.JCARLSON@uci.edu>


Fredrik Lundh <fredrik at pythonware.com> wrote:
> Martin v. L?wis wrote:
> > I think supporting multiple representations at run-time would really
> > be terrible. Any API of the "give me the data" kind would either have
> > to expose the choice of representations, or perform a copy.
> 
> Unless you can guarantee that *all* external API:s that a Python 
> extension might want to use will use exactly the same internal 
> representation as Python, that's something that we have to deal with anyway.

I think Martin meant with regards to, for example, choosing an internal
Latin-1, UCS-2, or UCS-4 representation based on the code points of the
string.

I stated earlier that with a buffer interface that returned the *size*
of elements, users could program based on internal representation, but I
agree that it would be error prone.


What if we just chose UTF-16 as an internal representation?  No
defualt system encoding version attached (as it is right now). Extension
writers could write for the single representation, and convert if it
isn't what they want (and where is the default system encoding ever what
is desired?)


 - Josiah


From gabor at nekomancer.net  Mon Sep 25 01:48:29 2006
From: gabor at nekomancer.net (gabor)
Date: Mon, 25 Sep 2006 01:48:29 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <45152234.1090303@v.loewis.de>
References: <20060920154655.mrv1x9ksb24g8844@login.werra.lunarpages.com>	<ca471dc20609201552g54d03f4es736e56ebf94bffb0@mail.gmail.com>
	<45126E76.9020600@nekomancer.net> <45152234.1090303@v.loewis.de>
Message-ID: <4517194D.1030908@nekomancer.net>

Martin v. L?wis wrote:
> G?bor Farkas schrieb:
>> while i understand the constraints, i think it's not a good decision to 
>> leave this to be implementation-dependent.
>>
>> the strings seem to me as such a basic functionality, that it's 
>> behaviour should not depend on the platform.
>>
>> for example, how is an application developer then supposed to write 
>> their applications?
> 
> An application developer should always know what the target platforms
> are. For example, does the code need to work with IronPython or not?


i think if IronPython claims to be a python implementation, then at 
least for a simple hello-world style string manipulation program should 
behave the same way on IronPython and on Cpython.

(of course when it's a 'bigger' program, that use some python libraries, 
  then yes, he should know. but we are talking about a builtin type here)

> Python is not aiming at 100% portability at all costs. Many aspects
> are platform dependent, and while this has complicated some
> applications, is has simplified others (which could make use of
> platform details that otherwise would not have been exposed to the
> Python programmer).

hmmm.. i thought that all those 'platform dependent' aspects are in the 
libraries (win32/sys/posix/os/whatetever), and not in the "core" part.

so, are there any in the "core" (stupid naming i know. i mean 
not-in-libraries) part?


> 
>> should he write his own slicing/whatever functions to get consistent 
>> behaviour on linux/windows?
> 
> Depends on the application, and the specific slicing operations.
> If the slicing appears in the processing of .ini files (say),
> no platform-dependent slicing should be necessary.

why?

or you simply assume that an ini file cannot contain non-bmp unicode 
characters?

but if you'd like to have an example then:

let's say in an application i only want to display the first 70 
characters of a string.

now, for this to behave correctly on non-bmp characters, i will need to 
write a custom function, correct?

> 
>> but the same way i could say, that because most of the unix-world is 
>> utf-8, for those pythons the best way is to handle it internally as 
>> utf-8, couldn't i?
> 
> I think you live in a free country: you can certainly say that
> I think you would be wrong. The common on-disk/on-wire representation
> of text should not influence the design of an in-memory representation.

sorry, i should have clarified this more.

i simply reacted to the situation that for example cpython-win32 and 
IronPython use 16bit unicode-strings, which makes it easy for them to 
communicate with the (afaik) mostly 16bit-unicode win32 API.

on the other hand, for example GTK uses utf8-encoded strings...so when 
on linux the python-GTK bindings want to transfer strings, they will 
have to do charset-conversion.

but this was only an example.

> 
>> it simply seems to me strange to make compromises that makes the life of 
>> the cpython-users harder, just to make the life for the 
>> jython/ironpython developers (i mean the 'creators') easier.
> 
> Guido didn't say that the life of the CPython user needs to be hard.

hmmm.. for me having to worry about string-handling differences in the 
programming language i use qualifies as 'harder'.

> He said it will be implementation-dependent, referring to Jython
> and IronPython.
> Whether or not CPython uses a consistent representation
> or consistent python-level experience across platforms is a different
> issue. CPython could behave absolutely consistently, and use four-byte
> Unicode on all systems, and the length of a non-BMP string would
> still be implementation-defined.
> 


i understand that difference.

(i just find it hard to believe, that string-handling does not seem 
important enough to make it truly cross-platform (or cross-implementation))

gabor

From jcarlson at uci.edu  Mon Sep 25 06:34:12 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Sun, 24 Sep 2006 21:34:12 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <4517194D.1030908@nekomancer.net>
References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net>
Message-ID: <20060924210217.0873.JCARLSON@uci.edu>


gabor <gabor at nekomancer.net> wrote:
> Martin v. L?wis wrote:
> > G?bor Farkas schrieb:
[snip]
> > Python is not aiming at 100% portability at all costs. Many aspects
> > are platform dependent, and while this has complicated some
> > applications, is has simplified others (which could make use of
> > platform details that otherwise would not have been exposed to the
> > Python programmer).
> 
> hmmm.. i thought that all those 'platform dependent' aspects are in the 
> libraries (win32/sys/posix/os/whatetever), and not in the "core" part.
> 
> so, are there any in the "core" (stupid naming i know. i mean 
> not-in-libraries) part?

sys.setrecursionlimit(10000)

def foo():
    foo()

Run that in Windows, and you get a MemoryError.  Run it in Linux, and
you get a segfault.  Blame linux malloc.


> >> should he write his own slicing/whatever functions to get consistent 
> >> behaviour on linux/windows?
> > 
> > Depends on the application, and the specific slicing operations.
> > If the slicing appears in the processing of .ini files (say),
> > no platform-dependent slicing should be necessary.
[snip]
> let's say in an application i only want to display the first 70 
> characters of a string.
> 
> now, for this to behave correctly on non-bmp characters, i will need to 
> write a custom function, correct?

That depends on what you mean by "now," and on the Python compile option.
If you mean that "today ... i would need to write a custom function",
then you would be correct on a utf-16 compiled Python for all characters
with a code point > 65535, but not so on a ucs-4 build (but perhaps both
when there are surrogate pairs). In the future, the plan, I believe, is
to attempt to make utf-16 behave like ucs-4 eith regards to all
operations available from Python, at least for all characters
represented with a single code point.


> >> but the same way i could say, that because most of the unix-world is 
> >> utf-8, for those pythons the best way is to handle it internally as 
> >> utf-8, couldn't i?
> > 
> > I think you live in a free country: you can certainly say that
> > I think you would be wrong. The common on-disk/on-wire representation
> > of text should not influence the design of an in-memory representation.
> 
> sorry, i should have clarified this more.
> 
> i simply reacted to the situation that for example cpython-win32 and 
> IronPython use 16bit unicode-strings, which makes it easy for them to 
> communicate with the (afaik) mostly 16bit-unicode win32 API.
> 
> on the other hand, for example GTK uses utf8-encoded strings...so when 
> on linux the python-GTK bindings want to transfer strings, they will 
> have to do charset-conversion.
> 
> but this was only an example.

The current CPython implementation keeps two representations of unicode
strings in memory; the utf-16 or ucs-4 representation (depending on
compile-time options) and a default system encoding representation.  If
you set your default system encoding to be utf-8, Python doesn't need to
do anything more to hand unicode strings off to GTK, aside from
recognizing that it has what it wants already.


[snip]
> hmmm.. for me having to worry about string-handling differences in the 
> programming language i use qualifies as 'harder'.

With what Martin and Frederik have been saying recently, I don't believe
that you have anything significant to worry about when it comes to
string behavior on CPython vs. IronPython, Jython, or even PyPy.


> > He said it will be implementation-dependent, referring to Jython
> > and IronPython.
> > Whether or not CPython uses a consistent representation
> > or consistent python-level experience across platforms is a different
> > issue. CPython could behave absolutely consistently, and use four-byte
> > Unicode on all systems, and the length of a non-BMP string would
> > still be implementation-defined.
> 
> i understand that difference.
> 
> (i just find it hard to believe, that string-handling does not seem 
> important enough to make it truly cross-platform (or cross-implementation))

It is important, arguably one of the most important pieces.  But there
are three parts; 1) code points not currently defined within the unicode
spec, but who have specific encodings (based on the code point value), 2)
in the case of UTF-16 representations, Python's handling of characters >
65535, 3) surrogates.

I believe #1 is handled "correctly" today, Martin sounds like he wants
#2 fixed for Py3k (I don't believe anyone *doesn't* want it fixed), and
#3 could be fixed while fixing #2 with a little more work (if desired).


 - Josiah


From martin at v.loewis.de  Mon Sep 25 07:26:30 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Mon, 25 Sep 2006 07:26:30 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <20060924144006.086D.JCARLSON@uci.edu>
References: <20060923104310.0863.JCARLSON@uci.edu>
	<45158830.8020908@v.loewis.de>
	<20060924144006.086D.JCARLSON@uci.edu>
Message-ID: <45176886.2090201@v.loewis.de>

Josiah Carlson schrieb:
> What about a tree structure over the top of the string as I described in
> another post?  If there are no surrogate pairs, the pointer to the tree
> is null.  If there are surrogate pairs, we could either use the
> structure as I described, or even modify it so that we get even better
> memory utilization/performance (choose tree nodes based on where
> surrogate pairs are, up to some limit).

As always, it's a time-vs-space tradeoff. People tend to resolve these
in favor of time, accepting an increase in space. I'm not so sure this
is always the right answer. In the specific case, I'm also worried about
the increase in complexness.

That said, it is always good to have a prototype implementation to
analyse the consequences better.

Regards,
Martin

From qrczak at knm.org.pl  Mon Sep 25 11:57:10 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Mon, 25 Sep 2006 11:57:10 +0200
Subject: [Python-3000] Transitional GC?
In-Reply-To: <4516D4B3.50905@acm.org> (talin@acm.org's message of "Sun, 24
	Sep 2006 11:55:47 -0700")
References: <4516D4B3.50905@acm.org>
Message-ID: <87psdkfevd.fsf@qrnik.zagroda>

Talin <talin at acm.org> writes:

> I wonder if there is a way to create an API for extension modules that 
> would allow a gradual phase-out of reference counting, towards a 'pure' GC.

I believe this is possible when C code doesn't access addresses of
Python objects directly, but via handles.

http://srfi.schemers.org/srfi-50/mail-archive/msg00295.html

See "Minor" link there, and the whole SRFI-50 discussion about
FFI styles.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From qrczak at knm.org.pl  Mon Sep 25 13:02:14 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Mon, 25 Sep 2006 13:02:14 +0200
Subject: [Python-3000] Removing __del__
In-Reply-To: <4515DB65.2090505@canterbury.ac.nz> (Greg Ewing's message of
	"Sun, 24 Sep 2006 13:12:05 +1200")
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<035f01c6def0$900594c0$4bbd2997@bagio>
	<4515DB65.2090505@canterbury.ac.nz>
Message-ID: <87hcywgqfd.fsf@qrnik.zagroda>

Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

> All use of weakrefs can be hidden behind a call such as
>
>    register_finalizer(self, func, *args, **kwds)

It should be possible to finalize the object explicitly, given a
handle returned by this function, and possibly to kill the finalizer
without execution.

The former is useful to implement close(). The latter is useful for
weak dictionaries: when an entry is removed because it's overwritten,
there is no need to keep a finalizer which will remove the old entry
when the key dies.

IMHO a weak reference can conveniently play the role of such handle.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From jimjjewett at gmail.com  Mon Sep 25 16:33:26 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Mon, 25 Sep 2006 10:33:26 -0400
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <20060924210217.0873.JCARLSON@uci.edu>
References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net>
	<20060924210217.0873.JCARLSON@uci.edu>
Message-ID: <fb6fbf560609250733m68433872u28e27b47dadcbe47@mail.gmail.com>

On 9/25/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> gabor <gabor at nekomancer.net> wrote:
> > Martin v. L?wis wrote:
> > > G?bor Farkas schrieb:

> > >> should he write his own slicing/whatever functions to get consistent
> > >> behaviour on linux/windows?

> > now, for this to behave correctly on non-bmp characters, i will need to
> > write a custom function, correct?

As David Hopwood pointed out, to be fully correct, you already have to
create a custom function even with bmp characters, because of
decomposed characters.  (Example:  Representing a c-cedilla as a c and
a combining cedilla, rather than as a single code point.)  Separating
those two would be wrong.  Counting them as two characters for slicing
purposes would usually be wrong.

Even 32-bit representations are permitted to use surrogate pairs; it
just doesn't often make sense.

These are problems inherent to unicode (or at least to non-normalized
unicode).  Different python implementations may expose the problem in
different places, but the problem is always there.

We *could* specify that slicing and indexing act as though the
underlying representation were normalized (and this would typically
require normalization as part of construction), but I'm not sure that
is the right answer.  Even if it were trivial, there are reasons not
to normalize.

> It is important, arguably one of the most important pieces.  But there
> are three parts; 1) code points not currently defined within the unicode
> spec, but who have specific encodings (based on the code point value), 2)
> in the case of UTF-16 representations, Python's handling of characters >
> 65535, 3) surrogates.

> I believe #1 is handled "correctly" today, Martin sounds like he wants
> #2 fixed for Py3k (I don't believe anyone *doesn't* want it fixed), and
> #3 could be fixed while fixing #2 with a little more work (if desired).

You also left out (4), decomposed characters, which is a more complex
version of surrogates.

Guido just stated that #2 is intentional,  though he didn't pronounce
that it should stay that way.  There are sound arguments both ways.
In particular, fixing it without fixing decomposed characters might
incur the cost without the benefit.

-jJ

From paul at prescod.net  Mon Sep 25 17:50:16 2006
From: paul at prescod.net (Paul Prescod)
Date: Mon, 25 Sep 2006 08:50:16 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <fb6fbf560609250733m68433872u28e27b47dadcbe47@mail.gmail.com>
References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net>
	<20060924210217.0873.JCARLSON@uci.edu>
	<fb6fbf560609250733m68433872u28e27b47dadcbe47@mail.gmail.com>
Message-ID: <1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com>

On 9/25/06, Jim Jewett <jimjjewett at gmail.com> wrote:
>
> As David Hopwood pointed out, to be fully correct, you already have to
> create a custom function even with bmp characters, because of
> decomposed characters.  (Example:  Representing a c-cedilla as a c and
> a combining cedilla, rather than as a single code point.)  Separating
> those two would be wrong.  Counting them as two characters for slicing
> purposes would usually be wrong.


Even 32-bit representations are permitted to use surrogate pairs; it
> just doesn't often make sense.


 There is at least one big difference between surrogate pairs and decomposed
characters. The user can typically normalize away decompositions. How do you
normalize away decompositions in a language that only supports 16-bit
representations?

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060925/bb100953/attachment.html 

From fredrik at pythonware.com  Mon Sep 25 18:01:21 2006
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Mon, 25 Sep 2006 18:01:21 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <4516B2D0.9020109@v.loewis.de>
References: <4511E644.2030306@blueyonder.co.uk>	<451523DC.2050901@v.loewis.de>	<20060923104310.0863.JCARLSON@uci.edu>	<45158830.8020908@v.loewis.de>	<ef5ugf$d9a$1@sea.gmane.org>
	<4516B2D0.9020109@v.loewis.de>
Message-ID: <ef8ugh$t6o$1@sea.gmane.org>

Martin v. L?wis wrote:

>>> I think supporting multiple representations at run-time would really
>>> be terrible. Any API of the "give me the data" kind would either have
>>> to expose the choice of representations, or perform a copy.
 >>
>> Unless you can guarantee that *all* external API:s that a Python 
>> extension might want to use will use exactly the same internal 
>> representation as Python, that's something that we have to deal with anyway.
> 
> APIs will certainly allow different kinds of memory buffers to
> create a Python string object. Creation is a fairly small part
> of the API

creation is not the problem; it's the "give me the data" API that's the 
problem.  or rather, the "give me the data in a form that's compatible 
with the 3rd party API that I'm about to call" API.

> I believe it would noticeably simplify the implementation if there is
 > only a single internal representation.

and I, wearing my string algorithm implementor hat, tend to disagree 
with that.  writing source code that can be compiled into efficient code 
for multiple representations is mostly trivial, even in C.

</F>


From david.nospam.hopwood at blueyonder.co.uk  Tue Sep 26 01:19:54 2006
From: david.nospam.hopwood at blueyonder.co.uk (David Hopwood)
Date: Tue, 26 Sep 2006 00:19:54 +0100
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com>
References: <45152234.1090303@v.loewis.de>
	<4517194D.1030908@nekomancer.net>	<20060924210217.0873.JCARLSON@uci.edu>	<fb6fbf560609250733m68433872u28e27b47dadcbe47@mail.gmail.com>
	<1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com>
Message-ID: <4518641A.4070500@blueyonder.co.uk>

Paul Prescod wrote:
> On 9/25/06, Jim Jewett <jimjjewett at gmail.com> wrote:
> 
>> As David Hopwood pointed out, to be fully correct, you already have to
>> create a custom function even with bmp characters, because of
>> decomposed characters.  (Example:  Representing a c-cedilla as a c and
>> a combining cedilla, rather than as a single code point.)  Separating
>> those two would be wrong.  Counting them as two characters for slicing
>> purposes would usually be wrong.
> 
> Even 32-bit representations are permitted to use surrogate pairs; it
> just doesn't often make sense.
> 
> There is at least one big difference between surrogate pairs and decomposed
> characters. The user can typically normalize away decompositions.

That depends what script they're using. For some scripts, they can't.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



From rhettinger at ewtllc.com  Tue Sep 26 01:41:51 2006
From: rhettinger at ewtllc.com (Raymond Hettinger)
Date: Mon, 25 Sep 2006 16:41:51 -0700
Subject: [Python-3000] Removing __del__
In-Reply-To: <03bb01c6def4$257b6c70$4bbd2997@bagio>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>	<008901c6de94$d2072ed0$4bbd2997@bagio>	<20060922235602.GA3427@panix.com>	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>	<039d01c6def1$46df1ef0$4bbd2997@bagio>	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio>
Message-ID: <4518693F.1050500@ewtllc.com>


>>I've never seen an API that works like that. Have you?
>>    
>>
>
>The class above shows a case where:
>
>1) There's a way to destruct the handle BEFORE __del__ is called, which would
>require killing the weakref / deregistering the finalization hook. I believe
>you agree that this is pretty common (I've around 10 usages of this pattern,
>__del__ with a separate explicit closure method, in one Python base-code of
>mine).
>  
>
ISTM, you've adopted __del__ as your best friend, learned to avoid it 
pitfalls, employed it throughout your code, and forsaken weakref based 
approaches which is understandable because weakrefs came along rather 
ate in the game. I congratulate you on that level of accomplishment.

I support the original suggestion to remove __del__ because I think that 
most programmers would be better-off without it, that weakref-based 
alternatives are possible (though not necessarily easier or more 
succinct), and that explicit finalization is preferable to implicit 
(i.e. there's a reason for advice to wrap file access in a try/finally 
to make sure an explicit close() occurs). 

In a world dominated by new-style classes, it is a strong plus that 
weakrefs reliably avoid creating cycles which subtlely block or delay 
finalization.  Eliminating __del__ will also mean an end to 
implementation headaches relating to issues stemming from arbitrary 
finalization code running while an object is still alive. The __del__ 
special method has long been a dark corner of Python, a rarely used and 
error-prone tool.  Just having it around creates a suggestion that it 
would be a good idea to design code relying on implicit finalization and 
the fragile hope that you or some future maintainer doesn't accidently 
keep a reference to an object you had intended to vanish of its own accord.

In short, __del__ should disappear not because it is useless but because 
it is hazardous.  The consenting adults philosophy means that we don't 
put-up artificial barriers to intentional hacks, but it does not mean 
that we bait the hook and leave error-prone traps for the unwary.  In 
Py3k, I would like to see explicit finalization as a preferred approach 
and for weakrefs be the one-way-to-do-it for designs with implicit 
finalization.


Raymond

From rasky at develer.com  Tue Sep 26 10:59:33 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Tue, 26 Sep 2006 10:59:33 +0200
Subject: [Python-3000] Removing __del__
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>	<008901c6de94$d2072ed0$4bbd2997@bagio>	<20060922235602.GA3427@panix.com>	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>	<039d01c6def1$46df1ef0$4bbd2997@bagio>	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio>
	<4518693F.1050500@ewtllc.com>
Message-ID: <120f01c6e14a$15edbfd0$4bbd2997@bagio>

Raymond Hettinger wrote:

> In short, __del__ should disappear not because it is useless but
> because
> it is hazardous.  The consenting adults philosophy means that we don't
> put-up artificial barriers to intentional hacks, but it does not mean
> that we bait the hook and leave error-prone traps for the unwary.  In
> Py3k, I would like to see explicit finalization as a preferred
> approach
> and for weakrefs be the one-way-to-do-it for designs with implicit
> finalization.

Raymond, there is one thing I don't understand in your line of reasoning. You
say that you prefer explicit finalization, but that implicit finalization still
needs to be supported. And for that, you'd rather drop __del__ and use
weakrefs. But why? You say that __del__ is harardous, but I can't see how
weakrefs are less hazardous. As an implicit finalization method, they live on
the fragile assumption that the callback won't hold a reference to the object:
an assumption which cannot be enforced in any way but cautious programming and
scrupolous auditing of the code. I assert that they hide bugs much better than
__del__ does (it's pretty easy to find an offending __del__ by looking at
gc.garbage, while it's harder to notice a missing finalization because the
cycle loop involving the weakref callback was broken at the wrong point).

I guess there's something escaping me. If we have to drop one, why is that
__del__? And if __del__ could be fixed to reliably work in reference cycles,
would you still want to drop it?

Giovanni Bajo


From greg.ewing at canterbury.ac.nz  Tue Sep 26 11:57:13 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Tue, 26 Sep 2006 21:57:13 +1200
Subject: [Python-3000] Removing __del__
In-Reply-To: <120f01c6e14a$15edbfd0$4bbd2997@bagio>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
Message-ID: <4518F979.50902@canterbury.ac.nz>

Giovanni Bajo wrote:
> I assert that they hide bugs much better than
> __del__ does (it's pretty easy to find an offending __del__ by looking at
> gc.garbage,

It should be feasible to modify the cyclic GC to
detect groups of objects that are only being kept
alive by references from the finalizer list. These
could be treated the same way as __del__-containing
cycles are now, and moved to a garbage list.

--
Greg

From tim.peters at gmail.com  Tue Sep 26 13:01:23 2006
From: tim.peters at gmail.com (Tim Peters)
Date: Tue, 26 Sep 2006 07:01:23 -0400
Subject: [Python-3000] Removing __del__
In-Reply-To: <120f01c6e14a$15edbfd0$4bbd2997@bagio>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
Message-ID: <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com>

[Giovanni Bajo]
> Raymond, there is one thing I don't understand in your line of reasoning. You
> say that you prefer explicit finalization, but that implicit finalization still
> needs to be supported. And for that, you'd rather drop __del__ and use
> weakrefs. But why? You say that __del__ is harardous, but I can't see how
> weakrefs are less hazardous. As an implicit finalization method, they live on
> the fragile assumption that the callback won't hold a reference to the object:
> an assumption which cannot be enforced in any way but cautious programming and
> scrupolous auditing of the code.

Nope, not so.  Read Modules/gc_weakref.txt for the gory details.  In
outline, there are three objects of interest here:  the weakly
referenced object (WO), the weakref (WR) to the WO, and the callback
(CB) callable attached to the WR.

/Normally/ the CB is reachable (== not trash).  If a reachable CB has
a strong reference to the WO, then that keeps the WO reachable too,
and of course the CB won't be invoked so long as its strong reference
keeps the WO alive.  The CB can't become trash either so long as the
WR is reachable, since the WR holds a strong reference to the CB.  If
the WR becomes trash while the WO is reachable, the WR clears its
reference to the CB, and then the CB will never be invoked period.

OTOH, if the CB has a weak reference to the WO, then when the WO goes
away and the CB is invoked, the CB's weak reference returns None
instead of the WO.

So in no case can a reachable CB actually get at the WO via the CB's
own strong or weak reference to the WO.  More, this is true even if
the WO is just strongly reachable via any path /from/ a reachable CB:
the fact that the CB is reachable guarantees the WO is reachable then
too.

Skipping details, things get muddier only when all three of these
objects end up in cyclic trash (CT) "at the same time".  The dodge
Python currently takes is that, when a WR is part of CT, and the WR's
referent is also part of CT, the WR's CB (if any) is never invoked.
This is defensible since the order in which trash objects are
finalized isn't defined, so it's legitimate to kill the WR first.
It's unclear whether that's entirely desirable behavior, though.
There were excruciating discussions about this earlier, but nobody had
a concrete use case favoring a specific position.

From rasky at develer.com  Tue Sep 26 14:19:34 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Tue, 26 Sep 2006 14:19:34 +0200
Subject: [Python-3000] Removing __del__
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio>
	<4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
	<1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com>
Message-ID: <01d301c6e166$070bde90$e303030a@trilan>

Tim Peters wrote:

> [Giovanni Bajo]
>> Raymond, there is one thing I don't understand in your line of
>> reasoning. You say that you prefer explicit finalization, but that
>> implicit finalization still needs to be supported. And for that,
>> you'd rather drop __del__ and use weakrefs. But why? You say that
>> __del__ is harardous, but I can't see how weakrefs are less
>> hazardous. As an implicit finalization method, they live on the
>> fragile assumption that the callback won't hold a reference to the
>> object: an assumption which cannot be enforced in any way but
>> cautious programming and scrupolous auditing of the code.
>
> Nope, not so.  Read Modules/gc_weakref.txt for the gory details.
> [...]
> The dodge
> Python currently takes is that, when a WR is part of CT, and the WR's
> referent is also part of CT, the WR's CB (if any) is never invoked.
> This is defensible since the order in which trash objects are
> finalized isn't defined, so it's legitimate to kill the WR first.
> It's unclear whether that's entirely desirable behavior, though.
> There were excruciating discussions about this earlier, but nobody had
> a concrete use case favoring a specific position.

Thanks for the explanation, and I believe you are confirming my position.
You are saying that the CB of a WR which is part of CT is never invoked. In
the above quote, I'm saying that if the user does a mistake and writes a CB
(as an implicit finalizer) which holds a reference to the WO, it is creating
a CT, so the CB will never be invoked. For instance:

class Wrapper:
     def __init__(self, *args):
           self.handle = CAPI.init(*args)
           self._wr = weakref.ref(self, lambda: CAPI.close(self.handle))   #
BUG HERE!

In this case, we have a CT: a Wrapper instance is the WO, which holds a
strong reference to the WR (self._wr), which holds a strong reference to the
CB (the lambda), which holds a strong reference to the WO again (through the
implicit usage of nested scopes). Thus, in this case, the CB will never be
called. Is that right? I have tried this variant to verify myself:

>>> import weakref
>>> class Wrapper:
...     def __init__(self):
...             def test():
...                     print "finalizer called", self.a
...             self.a = 1234
...             self._wr = weakref.ref(self, test)
...
>>> w = Wrapper()
>>> del w
>>> import gc
>>> gc.collect()
6
>>> gc.collect()
0
>>> gc.garbage
[]


Given these examples, I still can't see why weakrefs are thought to be a
preferrable solution for implicit finalization, when compared to __del__.
They mostly share the same problems when it comes to cyclic trash, but
__del__ is far more easier to teach, explain and understand. I can teach not
to use cycles with __del__ quickly and I can verify if there's a mistake by
looking at gc.garbage; teaching how to properly use weakrefs, callbacks, and
how to avoid reference loops with nested scopes is much harder to grasp in
the first place, and does not seem to provide any advantage.

===============================

Tim, I sort of hoped you jumped into this discussion. I had this link around
I wanted to show you:
http://mail.python.org/pipermail/python-dev/2000-March/002526.html

I re-read most threads in those weeks about finalization issues with cyclic
trash. Guido was proposing a solution with __del__ and CT, which
approximately worked this way:

- When a CT is detected, any __del__ method is invoked once per instance, in
random order.
- We make sure that each __del__ method is called once and only once per
instance (by using some sort of flag; Guido was proposing to set
self.__dict__["__del__"] = None, but that predates new-style classes as far
as I can tell).
- After all __del__ methods in the CT have been called exactly once, we
collect the trash as usual (break links by reclaiming the __dict__ of the
instances, or whatever).

Since we are discussing Py3k here, I believe it is the right time to revive
this discussion. The __close__ proposal I'm backing (sumed up in this mail:
http://mail.python.org/pipermail/python-3000/2006-September/003892.html) is
pretty similar to how Guido was proposing to modify __del__. If there are
technical grounds for this (and my opinion does not matter much, but Guido
was proposing the same thing, which kinds of gives me hope in this regard),
I believe it would be a far superior solution for the problem of implicit
finalization in the presence of CT in Py3k.

I think the idea is that, if you make sure that a __close__ method is called
exactly once (and before __dict__ is reclaimed), it really does not matter
much in which order you call __close__ methods within the CT. I mean, it
*might* matter for already-written in-the-wild __del__ methods of course,
but it sounds a *very* reasonable constraint for Py3k's __close__ methods. I
would like to see real-world examples where calling __close__ in random
order break things.

In the message linked above, you reply with:

[Tim]
> I would have no objection to "__del__ called only once" if it weren't
> for that Python currently does something different.  I don't know
> whether people rely on that now; if they do, it's a much more
> dangerous thing to change than adding a new keyword.

Would you still have this same position? Do you consider this "only once"
rule as a possible way to solve implicit finalization in GC?
-- 
Giovanni Bajo


From qrczak at knm.org.pl  Tue Sep 26 14:24:27 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 26 Sep 2006 14:24:27 +0200
Subject: [Python-3000] Removing __del__
In-Reply-To: <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com> (Tim
	Peters's message of "Tue, 26 Sep 2006 07:01:23 -0400")
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
	<1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com>
Message-ID: <87lko63jes.fsf@qrnik.zagroda>

"Tim Peters" <tim.peters at gmail.com> writes:

> Read Modules/gc_weakref.txt for the gory details.

"It's a feature of Python's weakrefs too that when a weakref goes
away, the callback (if any) associated with it is thrown away too,
unexecuted."

I disagree with this choice. Doesn't it prevent weakrefs to be used as
finalizers?

Here is my semantics:

The core weakref constructor has three arguments: a key, a value,
and a finalizer.

(The finalizer is conceptually a function with no parameters.
In Python it's more convenient to make it a function with any arity,
along with the associated arguments.)

It's often the case that the key and the value are the same object.
The simplified form of the weakref constructor makes this assumption
and takes only a single object and a finalizer. The generic form is
needed for dictionaries with weak keys.

Creating a weak reference establishes a relationship:
- The key keeps the value alive.
- The weak reference and the finalizer are alive.

When the key dies, the relationship ends, and the finalizer is added
to a queue of finalizers to be executed.

Given a weak reference, you can obtain the value, which can possibly
return the information that the weakref is dead (None). You can also
invoke the finalizer explicitly, which also ends the relationship
(the thread is suspended if the finalizer is currently executing).
And you can kill the weak reference, ending the relationship.

I believe this is a sufficient design for most practical purposes.

See also
http://www.haible.de/bruno/papers/cs/weak/WeakDatastructures-writeup.html
but I disagree with the section about finalizers.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From qrczak at knm.org.pl  Tue Sep 26 15:15:17 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 26 Sep 2006 15:15:17 +0200
Subject: [Python-3000] Removing __del__
In-Reply-To: <01d301c6e166$070bde90$e303030a@trilan> (Giovanni Bajo's
	message of "Tue, 26 Sep 2006 14:19:34 +0200")
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
	<1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com>
	<01d301c6e166$070bde90$e303030a@trilan>
Message-ID: <87irjahiqi.fsf@qrnik.zagroda>

"Giovanni Bajo" <rasky at develer.com> writes:

> Guido was proposing a solution with __del__ and CT, which
> approximately worked this way:
>
> - When a CT is detected, any __del__ method is invoked once per
> instance, in random order.

This means that __del__ may attempt to use an object which has already
had its __del__ called.

> Since we are discussing Py3k here, I believe it is the right time to revive
> this discussion. The __close__ proposal I'm backing (sumed up in this mail:
> http://mail.python.org/pipermail/python-3000/2006-September/003892.html) is
> pretty similar to how Guido was proposing to modify __del__.

"1) call __close__ on the instances *BEFORE* dropping the references.
The code in __close__ could break the cycle itself."

Same problem as above.

Note that the problem is solvable when the subset of links in these
objects which is needed during finalization doesn't contain cycles.
But the language implementation can't know *which* links are these.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From jimjjewett at gmail.com  Tue Sep 26 15:22:21 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 26 Sep 2006 09:22:21 -0400
Subject: [Python-3000] Removing __del__
In-Reply-To: <4518F979.50902@canterbury.ac.nz>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
	<4518F979.50902@canterbury.ac.nz>
Message-ID: <fb6fbf560609260622r43bcb1c0uc887b72e901a0701@mail.gmail.com>

On 9/26/06, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Giovanni Bajo wrote:
> > I assert that they hide bugs much better than
> > __del__ does (it's pretty easy to find an offending __del__ by looking at
> > gc.garbage,

> It should be feasible to modify the cyclic GC to
> detect groups of objects that are only being kept
> alive by references from the finalizer list.

This would let you use a bound method again, but ...

Given this complexity, what advantage would it have over __del__, let
alone __close__?

-jJ

From jimjjewett at gmail.com  Tue Sep 26 15:30:01 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 26 Sep 2006 09:30:01 -0400
Subject: [Python-3000] Removing __del__
In-Reply-To: <1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
	<1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com>
Message-ID: <fb6fbf560609260630k48130c98h28de5190653fba31@mail.gmail.com>

On 9/26/06, Tim Peters <tim.peters at gmail.com> wrote:
> [Giovanni Bajo]
> > You say that __del__ is harardous, but I can't see how
> > weakrefs are less hazardous. As an implicit finalization method, they live on
> > the fragile assumption that the callback won't hold a reference to the object:

> Nope, not so.

I think you read "live" as "not trash", but in this particular
sentence, he meant it as "be useful".

> Read Modules/gc_weakref.txt for the gory details.  In
> outline, there are three objects of interest here:  the weakly
> referenced object (WO), the weakref (WR) to the WO, and the callback
> (CB) callable attached to the WR.

> /Normally/ the CB is reachable (== not trash).

(Otherwise it can't act as a finalizer, because it isn't around)

> If a reachable CB has
> a strong reference to the WO, then that keeps the WO reachable too,

So it doesn't act as a finalizer; it acts as an immortalizer.  All the
pain of __del__, and it takes only one to make a loop.  (Bound methods
are in this category.)

> OTOH, if the CB has a weak reference to the WO, then when the WO goes
> away and the CB is invoked, the CB's weak reference returns None
> instead of the WO.

So it still can't act as a proper finalizer, if only because it isn't
fast enough.

-jJ

From rasky at develer.com  Tue Sep 26 15:32:10 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Tue, 26 Sep 2006 15:32:10 +0200
Subject: [Python-3000] Removing __del__
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio>
	<4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
	<1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com>
	<fb6fbf560609260630k48130c98h28de5190653fba31@mail.gmail.com>
Message-ID: <033901c6e170$2b8103e0$e303030a@trilan>

Jim Jewett wrote:

>>> You say that __del__ is harardous, but I can't see how
>>> weakrefs are less hazardous. As an implicit finalization method,
>>> they live on the fragile assumption that the callback won't hold a
>>> reference to the object: 
> 
>> Nope, not so.
> 
> I think you read "live" as "not trash", but in this particular
> sentence, he meant it as "be useful".

Yes. Sorry for my bad English...
-- 
Giovanni Bajo

From ncoghlan at gmail.com  Tue Sep 26 16:12:10 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 27 Sep 2006 00:12:10 +1000
Subject: [Python-3000] Removing __del__
In-Reply-To: <120f01c6e14a$15edbfd0$4bbd2997@bagio>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>	<008901c6de94$d2072ed0$4bbd2997@bagio>	<20060922235602.GA3427@panix.com>	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>	<039d01c6def1$46df1ef0$4bbd2997@bagio>	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>	<03bb01c6def4$257b6c70$4bbd2997@bagio>	<4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
Message-ID: <4519353A.2030103@gmail.com>

Giovanni Bajo wrote:
> Raymond Hettinger wrote:
> 
>> In short, __del__ should disappear not because it is useless but
>> because
>> it is hazardous.  The consenting adults philosophy means that we don't
>> put-up artificial barriers to intentional hacks, but it does not mean
>> that we bait the hook and leave error-prone traps for the unwary.  In
>> Py3k, I would like to see explicit finalization as a preferred
>> approach
>> and for weakrefs be the one-way-to-do-it for designs with implicit
>> finalization.
> 
> Raymond, there is one thing I don't understand in your line of reasoning. You
> say that you prefer explicit finalization, but that implicit finalization still
> needs to be supported. And for that, you'd rather drop __del__ and use
> weakrefs. But why? You say that __del__ is harardous, but I can't see how
> weakrefs are less hazardous.

As I see it, __del__ is more hazardous because it's an attractive nuisance - 
it *looks* like it should be easy to use, but I'm willing to bet that a lot of 
the __del__ methods implemented in the wild are either actual or potential 
bugs. For example, it would be easy for a maintenance programmer to make a 
change to include a reference in a data structure from a child node back to 
its parent node to address a problem, and suddenly the application's memory 
usage goes through the roof due to uncollectable cycles. Even the initial 
implementation of the generator __del__ slot in the *Python 2.5 core* was 
buggy, leading to such cycles - if the developers of the Python interpreter 
find it hard to get __del__ right, then there's something seriously wrong with 
it in its current form.

By explicitly stating that __del__ will go away in Py3k, with the current 
intent being to replace it with explicit finalization (via with statements) 
and the implicit finalization offered by weakref callbacks, it encourages 
people to look for ways to make the API for the latter easier to use.

For example, a "finalizer" factory function could be added to weakref:

_finalizer_refs = set()
def finalizer(*args, **kwds):
     """Create a finalizer from an object, callback and keyword dictionary"""
     # Use positional args and a closure to avoid namespace collisions
     obj, callback = args
     def _finalizer(_ref=None):
         """Callable that invokes the finalization callback"""
         # Use closure to get at weakref to allow direct invocation
         # This creates a cycle, so this approach relies on cyclic GC
         # to clean up the finalizer objects!
         try:
             _finalizer_refs.remove(ref)
         except KeyError:
             pass
         else:
             callback(_finalizer)
     # Give callback access to keyword arguments
     _finalizer.__dict__ = kwds
     ref = weakref.ref(obj, _finalizer)
     _finalizer_refs.add(ref)
     return _finalizer

Example usage:

from weakref import finalizer

class Wrapper(object):
     def __init__(self, x=1):
         self._data = finalizer(self, self.finalize, x=x)
     @staticmethod
     def finalize(data):
         print "Finalizing: value=%s!" % data.x
     def get_value(self):
         return self._data.x
     def increment(self, by=1):
         self._data.x += by
     def close(self):
         self._data()  # Explicitly invoke the finalizer
         self._data = None

 >>> test = Wrapper()
 >>> test.get_value()
1
 >>> test.increment(2)
 >>> test.get_value()
3
 >>> del test
Finalizing: value=3!
 >>> test = Wrapper()
 >>> test.get_value()
1
 >>> test.increment(2)
 >>> test.get_value()
3
 >>> test.close()
Finalizing: value=3!
 >>> del test

For comparison, here's the __del__ based version (which has the downside of 
potentially giving the cyclic GC fits if other attributes are added to the 
object):

class Wrapper(object):
     def __init__(self, x=1):
         self._x = x
     def __del__(self):
         if self._x is not None:
             print "Finalizing: value=%s!" % self._x
     def get_value(self):
         return self._x
     def increment(self, by=1):
         self._x += by
     def close(self):
         self.__del__()
         self._x = None

Not counting the import line, both versions are 13 lines long (granted, the 
weakref version would be a bit longer if the finalizer needed access to public 
attributes - in that case, the weakref version would need to use properties to 
hide the existence of the finalizer object).

Cheers,
Nick.

P.S. the central finalizers list also works a treat for debugging why objects 
aren't getting finalized as expected - a simple loop like "for wr in 
weakref.finalizers: print gc.get_referrers(wr)" after a gc.collect() call 
works pretty well.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From jimjjewett at gmail.com  Tue Sep 26 16:12:15 2006
From: jimjjewett at gmail.com (Jim Jewett)
Date: Tue, 26 Sep 2006 10:12:15 -0400
Subject: [Python-3000] Removing __del__
In-Reply-To: <87irjahiqi.fsf@qrnik.zagroda>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
	<1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com>
	<01d301c6e166$070bde90$e303030a@trilan> <87irjahiqi.fsf@qrnik.zagroda>
Message-ID: <fb6fbf560609260712o244e8589sfac901ab892ecd6@mail.gmail.com>

On 9/26/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
> "Giovanni Bajo" <rasky at develer.com> writes:

> > Guido was proposing a solution with __del__ and CT, which
> > approximately worked this way:

> > - When a CT is detected, any __del__ method is invoked once per
> > instance, in random order.

[Note that this "__del__" is closer to what we've been calling
__close__ than to the existing __del__.]

Note that the "at most" part of "once" is already a stronger promise
than __close__.  That's OK (maybe even helpful) for users, it just
makes the implementation harder.

> This means that __del__ [~= __close__] may attempt to use an object which
> has already had its __del__ called.

Yes; this is the most important change between between today's __del__
and the proposed __close__.

Today's __del__ doesn't have to defend against messed up subobjects,
because it immortalizes them.  A __close__ method would need to defend
against this, because of the arbitrary ordering.

In practice, close methods already defend against this anyhow, largely
because they know that they might be called by __del__ even after
being called explicitly.

-jJ

From rasky at develer.com  Tue Sep 26 16:41:52 2006
From: rasky at develer.com (Giovanni Bajo)
Date: Tue, 26 Sep 2006 16:41:52 +0200
Subject: [Python-3000] Removing __del__
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>	<008901c6de94$d2072ed0$4bbd2997@bagio>	<20060922235602.GA3427@panix.com>	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>	<039d01c6def1$46df1ef0$4bbd2997@bagio>	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>	<03bb01c6def4$257b6c70$4bbd2997@bagio>	<4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio> <4519353A.2030103@gmail.com>
Message-ID: <047f01c6e179$e7b74d40$e303030a@trilan>

Nick Coghlan wrote:

>> Raymond, there is one thing I don't understand in your line of
>> reasoning. You say that you prefer explicit finalization, but that
>> implicit finalization still needs to be supported. And for that,
>> you'd rather drop __del__ and use weakrefs. But why? You say that
>> __del__ is harardous, but I can't see how weakrefs are less
>> hazardous.
>
> As I see it, __del__ is more hazardous because it's an attractive
> nuisance - it *looks* like it should be easy to use, but I'm willing
> to bet that a lot of the __del__ methods implemented in the wild are
> either actual or potential bugs. For example, it would be easy for a
> maintenance programmer to make a change to include a reference in a
> data structure from a child node back to its parent node to address a
> problem, and suddenly the application's memory usage goes through the
> roof due to uncollectable cycles.

Is that easier or harder to detect such a cycle, compared to accidentally
adding a reference to self (through implicit nested scopes, or bound
methods) in the finalizer callback? You have to admit that, at best, they
are equally hazardous.

As things stand *now* (in Python 2.5 I mean),  __del__ is easier to
understand/teach, easier to debug (gc.garbage vs finalizers silently
ignored), and easier to use (no boilerplate in user's code, no additional
finalization API which does not even exist). I saw numerous proposal to
address these weakref "defects" by adding some kind of finalizer API, by
modifying the GC to put uncollectable loops with weakref finalizers in
gc.garbage, and so on. Most finalization APIs (including yours) create
cycles just by using them, which also mean that you *must* wait for the GC
to kick in before the object is finalized, making it useless for several
situations where you want implicit finalizations to happen immediately
(file.close() just to name one). [and we are speaking of implicit
finalization now, I know of 'with'].

It would require some effort to make weakref finalizers *barely* as usable
as __del__, and will absolutely not solve the problem per-se: the user will
still have to pay attention and understand the hoops (different kind of
hoops, but still hoops). So, why do we not spend this same time trying to
*fix* __del__ instead? If somebody comes up with a sane way to define the
semantic for a new finalizer method (like the __close__ proposal), which can
be invoked *even* in the case of cycles, would you still prefer to go the
weakref way?


> Even the initial implementation of
> the generator __del__ slot in the *Python 2.5 core* was buggy,
> leading to such cycles - if the developers of the Python interpreter
> find it hard to get __del__ right, then there's something seriously
> wrong with it in its current form.

I don't think it's a fair comparison: generator is a pretty complex class,
compared to an average class developed in Python which might need a __del__
method. I would also bet that you would get your first attempt of
finalization of generators through weakrefs wrong.


> By explicitly stating that __del__ will go away in Py3k, with the
> current intent being to replace it with explicit finalization (via
> with statements) and the implicit finalization offered by weakref
> callbacks, it encourages people to look for ways to make the API for
> the latter easier to use.
>
> For example, a "finalizer" factory function could be added to weakref:
>
> _finalizer_refs = set()
> def finalizer(*args, **kwds):
>      """Create a finalizer from an object, callback and keyword
>      dictionary""" # Use positional args and a closure to avoid
>      namespace collisions obj, callback = args
>      def _finalizer(_ref=None):
>          """Callable that invokes the finalization callback"""
>          # Use closure to get at weakref to allow direct invocation
>          # This creates a cycle, so this approach relies on cyclic GC
>          # to clean up the finalizer objects!
>          try:
>              _finalizer_refs.remove(ref)
>          except KeyError:
>              pass
>          else:
>              callback(_finalizer)
>      # Give callback access to keyword arguments
>      _finalizer.__dict__ = kwds
>      ref = weakref.ref(obj, _finalizer)
>      _finalizer_refs.add(ref)
>      return _finalizer

So uhm, am I reading it bad or your implementation (like any other similar
API I have seen till now) create a cycle *just* by using it? This finalizer
API ofhuscates user code by forcing to use a separate _data object to hold
(part of) the context for apparently no good reason, and make the object
collectable *only* through the cyclic GC (while __del__ would happily be
invoked in simple cases when the object goes out of context).

> P.S. the central finalizers list also works a treat for debugging why
> objects aren't getting finalized as expected - a simple loop like
> "for wr in weakref.finalizers: print gc.get_referrers(wr)" after a
> gc.collect() call works pretty well.

Yes, this is indeed interesting. One step closer to get at the __del__
feature set :)
-- 
Giovanni Bajo


From rrr at ronadam.com  Tue Sep 26 16:45:03 2006
From: rrr at ronadam.com (Ron Adam)
Date: Tue, 26 Sep 2006 09:45:03 -0500
Subject: [Python-3000] Removing __del__
In-Reply-To: <01d301c6e166$070bde90$e303030a@trilan>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>	<008901c6de94$d2072ed0$4bbd2997@bagio>	<20060922235602.GA3427@panix.com>	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>	<039d01c6def1$46df1ef0$4bbd2997@bagio>	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>	<03bb01c6def4$257b6c70$4bbd2997@bagio>	<4518693F.1050500@ewtllc.com>	<120f01c6e14a$15edbfd0$4bbd2997@bagio>	<1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com>
	<01d301c6e166$070bde90$e303030a@trilan>
Message-ID: <efbejo$j7$1@sea.gmane.org>

Giovanni Bajo wrote:

> Since we are discussing Py3k here, I believe it is the right time to revive
> this discussion. The __close__ proposal I'm backing (sumed up in this mail:
> http://mail.python.org/pipermail/python-3000/2006-September/003892.html) is
> pretty similar to how Guido was proposing to modify __del__. If there are
> technical grounds for this (and my opinion does not matter much, but Guido
> was proposing the same thing, which kinds of gives me hope in this regard),
> I believe it would be a far superior solution for the problem of implicit
> finalization in the presence of CT in Py3k.
> 
> I think the idea is that, if you make sure that a __close__ method is called
> exactly once (and before __dict__ is reclaimed), it really does not matter
> much in which order you call __close__ methods within the CT. I mean, it
> *might* matter for already-written in-the-wild __del__ methods of course,
> but it sounds a *very* reasonable constraint for Py3k's __close__ methods. I
> would like to see real-world examples where calling __close__ in random
> order break things.

How about...?  (This isn't an area I'm real familiar with.)


Replace __del__ with:

    a __final__ method and a __finalized__ flag.  (or other equivalent names)

    Insist on explicit finalizing by raising an exception if an objects
    __finalize__ flag is still False when it looses it's last reference.


Would this be difficult to do in a timely way so the traceback is meaningful?

Would this avoid the problems being discussed with both __del__ and weak refs?


    Ron



From ncoghlan at gmail.com  Tue Sep 26 17:41:58 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 27 Sep 2006 01:41:58 +1000
Subject: [Python-3000] Removing __del__
In-Reply-To: <047f01c6e179$e7b74d40$e303030a@trilan>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>	<008901c6de94$d2072ed0$4bbd2997@bagio>	<20060922235602.GA3427@panix.com>	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>	<039d01c6def1$46df1ef0$4bbd2997@bagio>	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>	<03bb01c6def4$257b6c70$4bbd2997@bagio>	<4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio> <4519353A.2030103@gmail.com>
	<047f01c6e179$e7b74d40$e303030a@trilan>
Message-ID: <45194A46.90406@gmail.com>

Giovanni Bajo wrote:
> It would require some effort to make weakref finalizers *barely* as usable
> as __del__, and will absolutely not solve the problem per-se: the user will
> still have to pay attention and understand the hoops (different kind of
> hoops, but still hoops). So, why do we not spend this same time trying to
> *fix* __del__ instead? If somebody comes up with a sane way to define the
> semantic for a new finalizer method (like the __close__ proposal), which can
> be invoked *even* in the case of cycles, would you still prefer to go the
> weakref way?

Yes. I believe any replacement for __del__ should be syntactic sugar for some 
form of weak reference callback. At the moment, we have two finalization 
methods (__del__ and weakref callbacks). Py3k gives us the opportunity to get 
rid of one of them. Since weakref callbacks are strictly more powerful, then 
__del__ should be the one to go.

Having first made the decision to reduce the number of finalization mechanisms 
to exactly one, I then have no problem with the idea of developing an easy to 
use weakref-based approach to replace the current __del__ (which may or may 
not be a magic method).

> So uhm, am I reading it bad or your implementation (like any other similar
> API I have seen till now) create a cycle *just* by using it?

To use Tim's terminology, the weakref (WR) and the callback (CB) are in a 
cycle with each other, so even after CB is invoked and removes WR from the 
global list of finalizers, the two objects won't go away until the next GC 
collection cycle. The weakly referenced object (WO) itself isn't part of the 
cycle and gets finalized at the first opportunity after its reference count 
goes to zero (as shown in my example - the finalizer ran without having to 
call gc.collect() first).

And don't forget that in non-refcounting implementations like Jython, 
IronPython and some flavours of PyPy, even non-cyclic garbage is collected 
through the GC mechanism at an arbitrary time after the last reference is 
released. If you need prompt finalization (for activities such as closing file 
handles or database connections), that's the whole reason the 'with' statement 
was added in Python 2.5.

All that aside, my example finalizer API only took an hour or two to write, 
compared to the significant amount of effort that has gone into the current 
__del__ implementation. There are actually a number of ways to write weakref 
based finalization that avoid that WR-CB cycle I used, but considering the 
trade-offs between those approaches is a lot more than a two-hour project 
(and, not the least bit incidentally, not an assessment I would really want to 
make on my own ;).

> This finalizer
> API ofhuscates user code by forcing to use a separate _data object to hold
> (part of) the context for apparently no good reason, and make the object
> collectable *only* through the cyclic GC (while __del__ would happily be
> invoked in simple cases when the object goes out of context).

It stores part of the context in a separate object for an *excellent* reason - 
it identifies clearly to the Python interpreter *which* parts of the object 
the finalizer can access. The biggest problem with __del__ is that it 
*doesn't* make that distinction, so the interpreter is forced to assume the 
finalizer might touch any part of the object (including the object itself), 
leading to all of the insanity with self-resurrection and the need to relegate 
things to gc.garbage. With a weakref-based approach, you only end up with two 
possible scenarios:

1. Object gets trashed and finalized
2. Object is kept immortal by a strong reference from the callback in the list 
of finalizers

By avoiding teaching people that care about finalization the important 
distinction between "things the finalizer can get at" and "things the object 
can get at but the finalizer can't", you aren't doing them any favours, 
because maintaining that distinction is the easiest way to avoid creating 
uncollectable cycles (i.e. by making sure the finalizer can't get at the other 
objects that might reference back to the current one).

Cheers,
Nick.


-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From rrr at ronadam.com  Tue Sep 26 17:32:27 2006
From: rrr at ronadam.com (Ron Adam)
Date: Tue, 26 Sep 2006 10:32:27 -0500
Subject: [Python-3000] Removing __del__
In-Reply-To: <efbejo$j7$1@sea.gmane.org>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>	<008901c6de94$d2072ed0$4bbd2997@bagio>	<20060922235602.GA3427@panix.com>	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>	<039d01c6def1$46df1ef0$4bbd2997@bagio>	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>	<03bb01c6def4$257b6c70$4bbd2997@bagio>	<4518693F.1050500@ewtllc.com>	<120f01c6e14a$15edbfd0$4bbd2997@bagio>	<1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com>	<01d301c6e166$070bde90$e303030a@trilan>
	<efbejo$j7$1@sea.gmane.org>
Message-ID: <efbhcj$cb6$1@sea.gmane.org>

This was bit too brief I think...

Ron Adam wrote:

> How about...?  (This isn't an area I'm real familiar with.)
> 
> 
> Replace __del__ with:
> 
>     a __final__ method and a __finalized__ flag.  (or other equivalent names)

The __final__ method would need to be explicitly called, and the __finalized__ 
flag could be set either by the interpreter or the __final__ method when 
__final__ is called.  __final__ would never be called implicitly by the interpreter.

>     Insist on explicit finalizing by raising an exception if an objects
>     __finalize__ flag is still False when it looses it's last reference.
> 
> 
> Would this be difficult to do in a timely way so the traceback is meaningful?
> 
> Would this avoid the problems being discussed with both __del__ and weak refs?
> 
> 
>     Ron

Maybe just adding only an optional __finalized__ flag, that when False forces an 
exception if an object looses it's references, might be enough.

I think....

It's not the actual closing/finishing/etc... that is the problem, it's the 
detecting when closing/finishing/etc... is not done that is the problem.

Cheers,
    Ron










From martin at v.loewis.de  Tue Sep 26 21:14:29 2006
From: martin at v.loewis.de (=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=)
Date: Tue, 26 Sep 2006 21:14:29 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com>
References: <45152234.1090303@v.loewis.de>
	<4517194D.1030908@nekomancer.net>	<20060924210217.0873.JCARLSON@uci.edu>	<fb6fbf560609250733m68433872u28e27b47dadcbe47@mail.gmail.com>
	<1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com>
Message-ID: <45197C15.9040005@v.loewis.de>

Paul Prescod schrieb:
>  There is at least one big difference between surrogate pairs and
> decomposed characters. The user can typically normalize away
> decompositions. How do you normalize away decompositions in a language
> that only supports 16-bit representations?

I don't see the problem: You use UTF-16; all normal forms (NFC, NFD,
NFKC, NFKD) can be represented in UTF-16 just fine.

It is somewhat tricky to implement a normalization algorithm in
UTF-16, since you must combine surrogate pairs first in order to
find out what the canonical decomposition of the code point is;
but it's just more code, and no problem in principle.

Regards,
Martin

From qrczak at knm.org.pl  Tue Sep 26 21:20:24 2006
From: qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk)
Date: Tue, 26 Sep 2006 21:20:24 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <45197C15.9040005@v.loewis.de> (Martin v.
	=?iso-8859-2?q?L=F6wis's?= message of "Tue, 26 Sep 2006 21:14:29 +0200")
References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net>
	<20060924210217.0873.JCARLSON@uci.edu>
	<fb6fbf560609250733m68433872u28e27b47dadcbe47@mail.gmail.com>
	<1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com>
	<45197C15.9040005@v.loewis.de>
Message-ID: <87ejtyjuyv.fsf@qrnik.zagroda>

"Martin v. L?wis" <martin at v.loewis.de> writes:

> It is somewhat tricky to implement a normalization algorithm in
> UTF-16, since you must combine surrogate pairs first in order to
> find out what the canonical decomposition of the code point is;
> but it's just more code, and no problem in principle.

The same issue is with virtually any algorithm: more code,
more complex code is needed with UTF-16 than with UTF-32.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From martin at v.loewis.de  Tue Sep 26 21:25:08 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Tue, 26 Sep 2006 21:25:08 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <ef8ugh$t6o$1@sea.gmane.org>
References: <4511E644.2030306@blueyonder.co.uk>	<451523DC.2050901@v.loewis.de>	<20060923104310.0863.JCARLSON@uci.edu>	<45158830.8020908@v.loewis.de>	<ef5ugf$d9a$1@sea.gmane.org>	<4516B2D0.9020109@v.loewis.de>
	<ef8ugh$t6o$1@sea.gmane.org>
Message-ID: <45197E94.3010502@v.loewis.de>

Fredrik Lundh schrieb:
>> I believe it would noticeably simplify the implementation if there is
>  > only a single internal representation.
> 
> and I, wearing my string algorithm implementor hat, tend to disagree 
> with that.  writing source code that can be compiled into efficient code 
> for multiple representations is mostly trivial, even in C.

I wouldn't call SRE's macro trickeries "trivial", though.

Regards,
Martin

From paul at prescod.net  Tue Sep 26 22:44:07 2006
From: paul at prescod.net (Paul Prescod)
Date: Tue, 26 Sep 2006 13:44:07 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <45197C15.9040005@v.loewis.de>
References: <45152234.1090303@v.loewis.de> <4517194D.1030908@nekomancer.net>
	<20060924210217.0873.JCARLSON@uci.edu>
	<fb6fbf560609250733m68433872u28e27b47dadcbe47@mail.gmail.com>
	<1cb725390609250850w51903f00w148b750afdae9ee8@mail.gmail.com>
	<45197C15.9040005@v.loewis.de>
Message-ID: <1cb725390609261344m51297926tac13968f33eaee82@mail.gmail.com>

I misspoke. I meant to ask: "How do you normalize away surrogate pairs in
UTF-16?" It was a rhetorical question. The point was just that decomposed
characters can be handled by implicit or explicit normalization. Surrogate
pairs can only be similarly normalized away if your model allows you to
represent their normalized forms. A UTF-16 characters model would not.

On 9/26/06, "Martin v. L?wis" <martin at v.loewis.de> wrote:
>
> Paul Prescod schrieb:
> >  There is at least one big difference between surrogate pairs and
> > decomposed characters. The user can typically normalize away
> > decompositions. How do you normalize away decompositions in a language
> > that only supports 16-bit representations?
>
> I don't see the problem: You use UTF-16; all normal forms (NFC, NFD,
> NFKC, NFKD) can be represented in UTF-16 just fine.
>
> It is somewhat tricky to implement a normalization algorithm in
> UTF-16, since you must combine surrogate pairs first in order to
> find out what the canonical decomposition of the code point is;
> but it's just more code, and no problem in principle.
>
> Regards,
> Martin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060926/717e1ceb/attachment.html 

From greg.ewing at canterbury.ac.nz  Wed Sep 27 02:36:14 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 27 Sep 2006 12:36:14 +1200
Subject: [Python-3000] Removing __del__
In-Reply-To: <87lko63jes.fsf@qrnik.zagroda>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
	<1f7befae0609260401g4d6a309asc8e1a11c56f0d9ec@mail.gmail.com>
	<87lko63jes.fsf@qrnik.zagroda>
Message-ID: <4519C77E.6020503@canterbury.ac.nz>

Marcin 'Qrczak' Kowalczyk wrote:

> "It's a feature of Python's weakrefs too that when a weakref goes
> away, the callback (if any) associated with it is thrown away too,
> unexecuted."
> 
> I disagree with this choice. Doesn't it prevent weakrefs to be used as
> finalizers?

No, it's quite possible to build a finalization mechanism
on top of weakrefs.

To register a finalizer F for an object O, you create a
weak reference W to O and store it in a global list.
You give W a callback that invokes F and then removes
W from the global list.

Now there's no way that W can go away before its callback
is invoked, since that's the only thing that removes it
from the global list.

Furthermore, if the user makes a mistake and registers
a function F that references its own object O, directly
or indirectly, then eventually we will be left with a
cycle that's only being kept alive from the global list
via W and its callback. The cyclic GC can detect this
situation and move the cycle to a garbage list or
otherwise alert the user.

I don't believe that this mechanism would be any
harder to use *correctly* than __del__ methods
currently are, and mistakes made in using it would
be no harder to debug.

--
Greg

From greg.ewing at canterbury.ac.nz  Wed Sep 27 02:36:21 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 27 Sep 2006 12:36:21 +1200
Subject: [Python-3000] Removing __del__
In-Reply-To: <fb6fbf560609260622r43bcb1c0uc887b72e901a0701@mail.gmail.com>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio>
	<4518F979.50902@canterbury.ac.nz>
	<fb6fbf560609260622r43bcb1c0uc887b72e901a0701@mail.gmail.com>
Message-ID: <4519C785.4010707@canterbury.ac.nz>

Jim Jewett wrote:

> Given this complexity, what advantage would it have over __del__, let
> alone __close__?

It wouldn't constitute an attractive nuisance, since
it would force you to think about which pieces of
information the finalizer really needs. This is
something you need to do anyway if you're to ensure
you don't get into trouble using __del__.

The supposed "easiness" of __del__ is really just
sloppiness that will turn around and bite you
eventually (if you'll excuse the mixed metaphor).

--
Greg

From greg.ewing at canterbury.ac.nz  Wed Sep 27 02:36:35 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 27 Sep 2006 12:36:35 +1200
Subject: [Python-3000] Removing __del__
In-Reply-To: <047f01c6e179$e7b74d40$e303030a@trilan>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio> <4519353A.2030103@gmail.com>
	<047f01c6e179$e7b74d40$e303030a@trilan>
Message-ID: <4519C793.2050800@canterbury.ac.nz>

Giovanni Bajo wrote:

> Is that easier or harder to detect such a cycle, compared to accidentally
> adding a reference to self (through implicit nested scopes, or bound
> methods) in the finalizer callback?

I would put a notice in the docs strongly recommending that
only global functions be registered as finalizers, not
nested functions or bound methods.

While not strictly necessary (or sufficient) for safety,
following this guideline would greatly reduce the chance
of accidentally creating troublesome cycles, I think.

And if you did accidentally create such a cycle, it seems
to me it would be much easier to fix than if you were
using __del__, since you only need to make an adjustment
to the parameter list of the finalizer.

With __del__, you need to refactor your whole finalization
strategy and create another object to do the finalization,
which is a much bigger upheaval.

> Most finalization APIs (including yours) create
> cycles just by using them, which also mean that you *must* wait for the GC
> to kick in before the object is finalized

No, a weakref-based finalizer will kick in just as soon
as __del__ would. I don't know what makes you think
otherwise.

> will absolutely not solve the problem per-se: the user will
> still have to pay attention and understand the hoops

Certainly, but it will make it much more obvious
that the hoops are there in the first place, and
exactly where and what shape they are.

> So, why do we not spend this same time trying to
> *fix* __del__ instead?

So far nobody has found a *way* to fix __del__
(really fix it, that is, not just paper over the
cracks). And a lot of smart people have given it
a lot of thought over the years.

If someone comes up with a way some day, we can
always put __del__ back in. But I don't feel like
holding my breath waiting for that to happen, when
we have something else that we know will work.

>>         # Use closure to get at weakref to allow direct invocation
>>         # This creates a cycle, so this approach relies on cyclic GC
>>         # to clean up the finalizer objects!

This implementation is broken. There's no need
to create any such cycle.

--
Greg

From greg.ewing at canterbury.ac.nz  Wed Sep 27 02:36:41 2006
From: greg.ewing at canterbury.ac.nz (Greg Ewing)
Date: Wed, 27 Sep 2006 12:36:41 +1200
Subject: [Python-3000] Removing __del__
In-Reply-To: <45194A46.90406@gmail.com>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio> <4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio> <4519353A.2030103@gmail.com>
	<047f01c6e179$e7b74d40$e303030a@trilan> <45194A46.90406@gmail.com>
Message-ID: <4519C799.8000603@canterbury.ac.nz>

Nick Coghlan wrote:
> the weakref (WR) and the callback (CB) are in a 
> cycle with each other, so even after CB is invoked and removes WR from the 
> global list of finalizers, the two objects won't go away until the next GC 
> collection cycle.

The CB can drop its reference to the WR when it's invoked.

--
Greg


From ncoghlan at gmail.com  Wed Sep 27 03:36:15 2006
From: ncoghlan at gmail.com (Nick Coghlan)
Date: Wed, 27 Sep 2006 11:36:15 +1000
Subject: [Python-3000] Removing __del__
In-Reply-To: <4519C793.2050800@canterbury.ac.nz>
References: <324634B71B159D469BCEB616678A6B94F94C3B@ingdexs1.ingdirect.com>
	<B6FAC926EFE7B348B12F29CF7E4A93D401CF46A1@hammer.office.bhtrader.com>
	<008901c6de94$d2072ed0$4bbd2997@bagio>
	<20060922235602.GA3427@panix.com>
	<6a36e7290609221735hcbd3df2ne41406323ce5fd72@mail.gmail.com>
	<039d01c6def1$46df1ef0$4bbd2997@bagio>
	<6a36e7290609230222w1fe8dfaam4780a1fd81481cd0@mail.gmail.com>
	<03bb01c6def4$257b6c70$4bbd2997@bagio>
	<4518693F.1050500@ewtllc.com>
	<120f01c6e14a$15edbfd0$4bbd2997@bagio> <4519353A.2030103@gmail.com>
	<047f01c6e179$e7b74d40$e303030a@trilan>
	<4519C793.2050800@canterbury.ac.nz>
Message-ID: <4519D58F.5070103@gmail.com>

Greg Ewing wrote:
>>>         # Use closure to get at weakref to allow direct invocation
>>>         # This creates a cycle, so this approach relies on cyclic GC
>>>         # to clean up the finalizer objects!
> 
> This implementation is broken. There's no need
> to create any such cycle.

I know, but it was late and my brain wasn't up to the job of getting rid of it :)

Here's a pretty easy way to fix it to avoid relying on the cyclic GC (actually 
based on your other message about explicitly breaking the cycle when the 
finalizer is invoked):

_finalizer_refs = set()
def finalizer(*args, **kwds):
      """Create a finalizer from an object, callback and keyword dictionary"""
      # Use positional args and a closure to avoid namespace collisions
      obj, callback = args
      def _finalizer(_ref=None):
          """Callable that invokes the finalization callback"""
          # Use closure to get at weakref to allow direct invocation
          try:
              ref = boxed_ref.pop()
          except IndexError:
              pass
          else:
              _finalizer_refs.remove(ref)
              callback(_finalizer)
      # Give callback access to keyword arguments
      _finalizer.__dict__ = kwds
      boxed_ref = [weakref.ref(obj, _finalizer)]
      _finalizer_refs.add(boxed_ref[0])
      return _finalizer

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org

From jcarlson at uci.edu  Thu Sep 28 01:32:33 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 27 Sep 2006 16:32:33 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <45176886.2090201@v.loewis.de>
References: <20060924144006.086D.JCARLSON@uci.edu>
	<45176886.2090201@v.loewis.de>
Message-ID: <20060927153914.089D.JCARLSON@uci.edu>


"Martin v. L?wis" <martin at v.loewis.de> wrote:
> 
> Josiah Carlson schrieb:
> > What about a tree structure over the top of the string as I described in
> > another post?  If there are no surrogate pairs, the pointer to the tree
> > is null.  If there are surrogate pairs, we could either use the
> > structure as I described, or even modify it so that we get even better
> > memory utilization/performance (choose tree nodes based on where
> > surrogate pairs are, up to some limit).
> 
> As always, it's a time-vs-space tradeoff. People tend to resolve these
> in favor of time, accepting an increase in space. I'm not so sure this
> is always the right answer. In the specific case, I'm also worried about
> the increase in complexness.
> 
> That said, it is always good to have a prototype implementation to
> analyse the consequences better.

I'm away from my main machine at the moment, so I am unable to test my
implementation, but I do have a sample.

There are two main functions to this implementation.  One which
constructs a tree for O(log n) worst-case access to character addresses,
and one which traverses the tree to discover the character address.  For
strings without surrogates, it's O(1) character address discovery.  The
implementation of surrogate discovery is very simple, using section 3.8
and 5.4 in the Unicode 4.0 standard.

If there are no surrogates, it takes a single pass over the input, and
constructs a single node (12 or 24 bytes, depending on the build, need
to replace long with Py_ssize_t).  If there are surrogates, it creates a
block of nodes, adjusts pointers to create a tree, and returns a pointer
to the root.  The tree will have at most O(n/logn) nodes, though it will
tend to create long blocks of non-surrogates, so that if you have a
single surrogate in the middle of a huge string, it will be conceptually
viewed as 3 blocks.


Attached is my untested sample implementation (I'm away for the next
week or so, and can't test), that should give an idea of what I was
talking about.

 - Josiah
-------------- next part --------------
A non-text attachment was scrubbed...
Name: surrogate_tree.c
Type: application/octet-stream
Size: 4688 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20060927/8a293362/attachment.obj 

From martin at v.loewis.de  Thu Sep 28 05:21:52 2006
From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=)
Date: Thu, 28 Sep 2006 05:21:52 +0200
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <20060927153914.089D.JCARLSON@uci.edu>
References: <20060924144006.086D.JCARLSON@uci.edu>
	<45176886.2090201@v.loewis.de>
	<20060927153914.089D.JCARLSON@uci.edu>
Message-ID: <451B3FD0.9030600@v.loewis.de>

Josiah Carlson schrieb:
> Attached is my untested sample implementation (I'm away for the next
> week or so, and can't test), that should give an idea of what I was
> talking about.

Thanks. It is hard to tell what the impact on the implementation is.
For example, ISTM that you have to regenerate the tree each time
a new string is created. E.g. if you slice a string, you would
have to regenerate the tree for the slice. Right?

As for the implementation: If you are using a array-based heap,
couldn't you just drop the left and right child pointers, and
instead use indices 2*k and 2*k+1 to find the child nodes?
This would get down memory overhead significantly; you'd only
need the length of the array to determine what a leaf node is.

Regards,
Martin

From jcarlson at uci.edu  Thu Sep 28 05:49:38 2006
From: jcarlson at uci.edu (Josiah Carlson)
Date: Wed, 27 Sep 2006 20:49:38 -0700
Subject: [Python-3000] How will unicode get used?
In-Reply-To: <451B3FD0.9030600@v.loewis.de>
References: <20060927153914.089D.JCARLSON@uci.edu>
	<451B3FD0.9030600@v.loewis.de>
Message-ID: <20060927204323.08A4.JCARLSON@uci.edu>


"Martin v. L?wis" <martin at v.loewis.de> wrote:
> 
> Josiah Carlson schrieb:
> > Attached is my untested sample implementation (I'm away for the next
> > week or so, and can't test), that should give an idea of what I was
> > talking about.
> 
> Thanks. It is hard to tell what the impact on the implementation is.
> For example, ISTM that you have to regenerate the tree each time
> a new string is created. E.g. if you slice a string, you would
> have to regenerate the tree for the slice. Right?

Generally, yes.  We could use the pre-existing tree information, but
it would probably be simpler (and faster) to scan the string and
re-create it.

Really, one would create the tree when someone wants to access an index
for the first time (or during creation, for fewer surprises), then use
the index finding function to return the address of character i.

> As for the implementation: If you are using a array-based heap,
> couldn't you just drop the left and right child pointers, and
> instead use indices 2*k and 2*k+1 to find the child nodes?
> This would get down memory overhead significantly; you'd only
> need the length of the array to determine what a leaf node is.

Good point.  I had originally malloced each node individually, but I
zoned the heap optimization when I went with that style of construction.

 - Josiah