Mailman 3 April 2009 - Python-Dev

Going off-line for a week
by Guido van Rossum 10 Apr '09

10 Apr '09

Folks, I'm going off-line for a week to enjoy a family vacation. When I come back I'll probably just archive most email unread, so now's your chance to add braces to the language. :-) Not-yet-retiring-ly y'rs, -- --Guido van Rossum (home page: http://www.python.org/~guido/)

1 0

Rethinking intern() and its data structure
by John Arbash Meinel 10 Apr '09

10 Apr '09

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I've been doing some memory profiling of my application, and I've found some interesting results with how intern() works. I was pretty surprised to see that the "interned" dict was actually consuming a significant amount of total memory. To give the specific values, after doing: bzr branch A B of a small project, the total memory consumption is ~21MB Of that, the largest single object is the 'interned' dict, at 1.57MB, which contains 22k strings. One interesting bit, the size of it + the referenced strings is only 2.4MB. So the "interned" dict *by itself* is 2/3rds the size of the dict + strings it contains. It also means that the average size of a referenced string is 37.4 bytes. A 'str' has 24 bytes of overhead, so the average string is 13.5 characters long. So to save references to 13.5*22k ~ 300kB of character data, we are paying 2.4MB, or about 8:1 overhead. When I looked at the actual references from interned, I saw mostly variable names. Considering that every variable goes through the python intern dict. And when you look at the intern function, it doesn't use setdefault logic, it actually does a get() followed by a set(), which means the cost of interning is 1-2 lookups depending on likelyhood, etc. (I saw a whole lot of strings as the error codes in win32all / winerror.py, and windows error codes tend to be longer-than-average variable length.) Anyway, I the internals of intern() could be done a bit better. Here are some concrete things: a) Don't keep a double reference to both key and value to the same object (1 pointer per entry), this could be as simple as using a Set() instead of a dict() b) Don't cache the hash key in the set, as strings already cache them. (1 long per entry). This is a big win for space, but would need to be balanced against lookup and collision resolving speed. My guess is that reducing the size of the set will actually improve speed more, because more items can fit in cache. It depends on how many times you need to resolve a collision. If the string hash is sufficiently spread out, and the load factor is reasonable, then likely when you actually find an item in the set, it will be the item you want, and you'll need to bring the string object into cache anyway, so that you can do a string comparison (rather than just a hash comparison.) c) Use the existing lookup function one time. (PySet->lookup()) Sets already have a "lookup" which is optimized for strings, and returns a pointer to where the object would go if it exists. Which means the intern() function can do a single lookup resolving any collisions, and return the object or insert without doing a second lookup. d) Having a special structure might also allow for separate optimizing of things like 'default size', 'grow rate', 'load factor', etc. A lot of this could be tuned specifically knowing that we really only have 1 of these objects, and it is going to be pointing at a lot of strings that are < 50 bytes long. If hashes of variable name strings are well distributed, we could probably get away with a load factor of 2. If we know we are likely to have lots and lots that never go away (you rarely *unload* modules, and all variable names are in the intern dict), that would suggest having a large initial size, and probably a wide growth factor to avoid spending a lot of time resizing the set. e) How tuned is String.hash() for the fact that most of these strings are going to be ascii text? (I know that python wants to support non-ascii variable names, but I still think there is going to be an overwhelming bias towards characters in the range 65-122 ('A'-'z'). Also note that the performance of the "interned" dict gets even worse on 64-bit platforms. Where the size of a 'dictentry' doubles, but the average length of a variable name wouldn't change. Anyway, I would be happy to implement something along the lines of a "StringSet", or maybe the "InternSet", etc. I just wanted to check if people would be interested or not. John =:-> PS> I'm not yet subscribed to python-dev, so if you could make sure to CC me in replies, I would appreciate it. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkneDfYACgkQJdeBCYSNAAPMywCfQVWOg51dtIkWT/jttVTARV0g WJ4An1w7ypB+akHT5hiSwRKoUhH7ez4j =9TTp -----END PGP SIGNATURE-----

18 25

decorator module in stdlib?
by Daniel Fetchinson 10 Apr '09

10 Apr '09

The decorator module [1] written by Michele Simionato is a very useful tool for maintaining function signatures while applying a decorator. Many different projects implement their own versions of the same functionality, for example turbogears has its own utility for this, I guess others do something similar too. Was the issue whether to include this module in the stdlib raised? If yes, what were the arguments against it? If not, what do you folks think, shouldn't it be included? I certainly think it should be. Originally I sent this message to c.l.p [2] and Michele suggested it be brought up on python-dev. He also pointed out that a PEP [3] is already written about this topic and it is in draft form. What do you guys think, wouldn't this be a useful addition to functools? Cheers, Daniel [1] http://pypi.python.org/pypi/decorator [2] http://groups.google.com/group/comp.lang.python/browse_thread/thread/d40560… [3] http://www.python.org/dev/peps/pep-0362/ -- Psss, psss, put it down! - http://www.cafepress.com/putitdown

7 13

Re: [Python-Dev] Rethinking intern() and its data structure
by John Arbash Meinel 09 Apr '09

09 Apr '09

Alexander Belopolsky wrote: > On Thu, Apr 9, 2009 at 11:02 AM, John Arbash Meinel > <john(a)arbash-meinel.com> wrote: > ... >> a) Don't keep a double reference to both key and value to the same >> object (1 pointer per entry), this could be as simple as using a >> Set() instead of a dict() >> > > There is a rejected patch implementing just that: > http://bugs.python.org/issue1507011 . > Thanks for the heads up. So reading that thread, the final reason it was rejected was 2 part: Without reviewing the patch again, I also doubt it is capable of getting rid of the reference count cheating: essentially, this cheating enables the interning dictionary to have weak references to strings, this is important to allow automatic collection of certain interned strings. This feature needs to be preserved, so the cheating in the reference count must continue. That specific argument was invalid. Because the patch just changed the refcount trickery to use +- 1. And I'm pretty sure Alexander's argument was just that +- 2 was weird, not that the "weakref" behavior was bad. The other argument against the patch was based on the idea that: The operation "give me the member equal but not identical to E" is conceptually a lookup operation; the mathematical set construct has no such operation, and the Python set models it closely. IOW, set is *not* a dict with key==value. I don't know if there was any consensus reached on this, since only Martin responded this way. I can say that for my "do some work with a medium size code base", the overhead of "interned" as a dictionary was 1.5MB out of 20MB total memory. Simply changing it to a Set would drop this to 1.0MB. I have no proof about the impact on performance, since I haven't benchmarked it yet. Changing it to a StringSet could further drop it to 0.5MB. I would guess that any performance impact would depend on whether the total size of 'interned' would fit inside L2 cache or not. There is a small bug in the original patch adding the string to the set failed. Namely it would return "t == NULL" which would be "t != s" and the intern in place would end up setting your pointer to NULL rather than doing nothing and clearing the error code. So I guess some of it comes down to whether "loweis" would also reject this change on the basis that mathematically a "set is not a dict". Though given that his claim "nobody else is speaking in favor of the patch", while at least Colin Winter has expressed some interest at this point. John =:->

3 6

calling dictresize outside dictobject.c
by Dan Schult 09 Apr '09

09 Apr '09

Hi, I'm trying to write a C extension which is a subclass of dict. I want to do something like a setdefault() but with a single lookup. Looking through the dictobject code, the three workhorse routines lookdict, insertdict and dictresize are not available directly for functions outside dictobject.c, but I can get at lookdict through dict->ma_lookup(). So I use lookdict to get the PyDictEntry (call it ep) I'm looking for. The comments for lookdict say ep is ready to be set... so I do that. Then I check whether the dict needs to be resized--following the nice example of PyDict_SetItem. But I can't call dictresize to finish off the process. Should I be using PyDict_SetItem directly? No... it does its own lookup. I don't want a second lookup! I already know which entry will be filled. So then I look at the code for setdefault and it also does a double lookup for checking and setting an entry. What subtle issue am I missing? Why does setdefault do a double lookup? More globally, why isn't dictresize available through the C-API? If there isn't a reason to do a double lookup I have a patch for setdefault, but I thought I should ask here first. Thanks! Dan

4 3

py3k build erroring out on fileio?
by Jeroen Ruigrok van der Werven 09 Apr '09

09 Apr '09

Just to make sure I am not doing something silly, with a configure line as such: ./configure --prefix=/home/asmodai/local --with-wide-unicode --with-pymalloc --with-threads --with-computed-gotos, would there be any reason why I am getting the following error with both BSD make and gmake: make: don't know how to make ./Modules/_fileio.c. Stop [Will log an issue if it turns out to, indeed, be a problem with the tree and not me.] -- Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai イェルーンラウフロックヴァンデルウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B Forgive us our trespasses, as we forgive those that trespass against us...

2 2

Re: [Python-Dev] Mercurial?
by Dirkjan Ochtman 09 Apr '09

09 Apr '09

(going back on-list) On 05/04/2009 15:42, Alexandre Vassalotti wrote: >> I'm pretty sure that we'll need to reconvert; I don't think the current >> conversion is particularly good. > > What is bad about it? For one thing, it has the [svn] prefixes, which I found to be quite ugly. hgsubversion in many cases will preserve the rev order from svn so that the local revision numbers that hg shows will be the same as in SVN anyway. On top of that, good conversion tools save the svn revision in the revision metadata in hg, so that you can see it with log --debug. For another, I'd like to use an author map to bring the revision authors more in line with what Mercurial repositories usually display; this helps with tool support and is also just a nicer solution IMO. I have a stab at an author map at http://dirkjan.ochtman.nl/author-map. Could use some review, but it seems like a good start. > I largely prefer clone to named branches. From personal experience, I > found named branches difficult to use properly. And, I think even > Mercurial developers don't use them. No, the Mercurial project currently doesn't use them. Mozilla does use them at the moment, because they found they did have some advantages (especially lower disk usage because no separate clones were needed). I think named branches are fine for long-lived branches. At the very least we should have a proper discussion over this. > How do you reorder the revlog of a repository? There are scripts for this which can be investigated. > I am in favor of pruning the old branches, but not of leaving the old > history behind. The current Mercurial mirror of py3k is 92M on my disk > which is totally reasonable. So, I don't see what would be the > advantage there. The current Mercurial mirror for py3k also doesn't include any history from before it was branched, which is bad, IMO. In order to get the most of the DVCS structure, it would be helpful if py3k shared history with the normal (trunk) branches. > I was thinking of something very basic—e.g., something like a commit > hook that would asynchronously commit the latest revision to svn. We > wouldn't to keep convert much meta-data just the committer's name and > the changelog would be fine. What's the use case, who do you want to support with this? hgweb trivially provides tarballs for download on every revision, so people who don't want to use hg can easily download a snapshot. > Not really. Currently, core developers can only push stuff using the > Bazaar setup. Personally, I think SSH access would be a lot nicer, but > this will depend how confident python.org's admins are with this idea. We could still enable pushing through http(s) for hgweb(dir). Cheers, Dirkjan

14 37

Re: [Python-Dev] decorator module in stdlib?
by P.J. Eby 08 Apr '09

08 Apr '09

At 10:51 AM 4/8/2009 -0700, Guido van Rossum wrote: >I would like it even less if an API cared about the >*actual* signature of a function I pass into it. One notable use of callable argument inspection is Bobo, the 12-years-ago predecessor to Zope, which used argument information to determine form or query string parameter names. (Were Bobo being written for the first time today for Python 3, I imagine it would use argument annotations to specify types, instead of requiring them to be in the client-side field names.) Bobo, of course, is just a single case of the general pattern of tools that expose a callable to some other (possibly explicitly-typed) system. E.g., wrapping Python functions for exposure to C, Java, .NET, CORBA, SOAP, etc. Anyway, it's nice for decorators to be transparent to inspection when the decorator doesn't actually modify the calling signature, so that you can then use your decorated functions with tools like the above.

2 1

slightly inconsistent set/list pop behaviour
by Tennessee Leeuwenburg 08 Apr '09

08 Apr '09

Now, I know that sets aren't ordered, but... foo = set([1,2,3,4,5]) bar = [1,2,3,4,5] foo.pop() will reliably return 1 while bar.pop() will return 5 discuss :) Cheers, -T

11 12

Update PEP 374 (DVCS)
by Aahz 08 Apr '09

08 Apr '09

Someone listed this URL on c.l.py and I thought it would make a good reference addition to PEP 374 (DVCS decision): http://www.catb.org/~esr/writings/version-control/version-control.html -- Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/ "...string iteration isn't about treating strings as sequences of strings, it's about treating strings as sequences of characters. The fact that characters are also strings is the reason we have problems, but characters are strings for other good reasons." --Aahz

1 0