-----BEGIN PGP SIGNED MESSAGE-----
I've been doing some memory profiling of my application, and I've found
some interesting results with how intern() works. I was pretty surprised
to see that the "interned" dict was actually consuming a significant
amount of total memory.
To give the specific values, after doing:
bzr branch A B
of a small project, the total memory consumption is ~21MB
Of that, the largest single object is the 'interned' dict, at 1.57MB,
which contains 22k strings. One interesting bit, the size of it + the
referenced strings is only 2.4MB. So the "interned" dict *by itself* is
2/3rds the size of the dict + strings it contains.
It also means that the average size of a referenced string is 37.4
bytes. A 'str' has 24 bytes of overhead, so the average string is 13.5
characters long. So to save references to 13.5*22k ~ 300kB of character
data, we are paying 2.4MB, or about 8:1 overhead.
When I looked at the actual references from interned, I saw mostly
variable names. Considering that every variable goes through the python
intern dict. And when you look at the intern function, it doesn't use
setdefault logic, it actually does a get() followed by a set(), which
means the cost of interning is 1-2 lookups depending on likelyhood, etc.
(I saw a whole lot of strings as the error codes in win32all /
winerror.py, and windows error codes tend to be longer-than-average
Anyway, I the internals of intern() could be done a bit better. Here are
some concrete things:
a) Don't keep a double reference to both key and value to the same
object (1 pointer per entry), this could be as simple as using a
Set() instead of a dict()
b) Don't cache the hash key in the set, as strings already cache them.
(1 long per entry). This is a big win for space, but would need to
be balanced against lookup and collision resolving speed.
My guess is that reducing the size of the set will actually improve
speed more, because more items can fit in cache. It depends on how
many times you need to resolve a collision. If the string hash is
sufficiently spread out, and the load factor is reasonable, then
likely when you actually find an item in the set, it will be the
item you want, and you'll need to bring the string object into
cache anyway, so that you can do a string comparison (rather than
just a hash comparison.)
c) Use the existing lookup function one time. (PySet->lookup())
Sets already have a "lookup" which is optimized for strings, and
returns a pointer to where the object would go if it exists. Which
means the intern() function can do a single lookup resolving any
collisions, and return the object or insert without doing a second
d) Having a special structure might also allow for separate optimizing
of things like 'default size', 'grow rate', 'load factor', etc. A
lot of this could be tuned specifically knowing that we really only
have 1 of these objects, and it is going to be pointing at a lot of
strings that are < 50 bytes long.
If hashes of variable name strings are well distributed, we could
probably get away with a load factor of 2. If we know we are likely
to have lots and lots that never go away (you rarely *unload*
modules, and all variable names are in the intern dict), that would
suggest having a large initial size, and probably a wide growth
factor to avoid spending a lot of time resizing the set.
e) How tuned is String.hash() for the fact that most of these strings
are going to be ascii text? (I know that python wants to support
non-ascii variable names, but I still think there is going to be an
overwhelming bias towards characters in the range 65-122 ('A'-'z').
Also note that the performance of the "interned" dict gets even worse on
64-bit platforms. Where the size of a 'dictentry' doubles, but the
average length of a variable name wouldn't change.
Anyway, I would be happy to implement something along the lines of a
"StringSet", or maybe the "InternSet", etc. I just wanted to check if
people would be interested or not.
PS> I'm not yet subscribed to python-dev, so if you could make sure to
CC me in replies, I would appreciate it.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----
The decorator module  written by Michele Simionato is a very useful
tool for maintaining function signatures while applying a decorator.
Many different projects implement their own versions of the same
functionality, for example turbogears has its own utility for this, I
guess others do something similar too.
Was the issue whether to include this module in the stdlib raised? If
yes, what were the arguments against it? If not, what do you folks
think, shouldn't it be included? I certainly think it should be.
Originally I sent this message to c.l.p  and Michele suggested it
be brought up on python-dev. He also pointed out that a PEP  is
already written about this topic and it is in draft form.
What do you guys think, wouldn't this be a useful addition to functools?
Psss, psss, put it down! - http://www.cafepress.com/putitdown
Alexander Belopolsky wrote:
> On Thu, Apr 9, 2009 at 11:02 AM, John Arbash Meinel
> <john(a)arbash-meinel.com> wrote:
>> a) Don't keep a double reference to both key and value to the same
>> object (1 pointer per entry), this could be as simple as using a
>> Set() instead of a dict()
> There is a rejected patch implementing just that:
> http://bugs.python.org/issue1507011 .
Thanks for the heads up.
So reading that thread, the final reason it was rejected was 2 part:
Without reviewing the patch again, I also doubt it is capable of
getting rid of the reference count cheating: essentially, this
cheating enables the interning dictionary to have weak references to
strings, this is important to allow automatic collection of certain
interned strings. This feature needs to be preserved, so the cheating
in the reference count must continue.
That specific argument was invalid. Because the patch just changed the
refcount trickery to use +- 1. And I'm pretty sure Alexander's argument
was just that +- 2 was weird, not that the "weakref" behavior was bad.
The other argument against the patch was based on the idea that:
The operation "give me the member equal but not identical to E" is
conceptually a lookup operation; the mathematical set construct has no
such operation, and the Python set models it closely. IOW, set is
*not* a dict with key==value.
I don't know if there was any consensus reached on this, since only
Martin responded this way.
I can say that for my "do some work with a medium size code base", the
overhead of "interned" as a dictionary was 1.5MB out of 20MB total memory.
Simply changing it to a Set would drop this to 1.0MB. I have no proof
about the impact on performance, since I haven't benchmarked it yet.
Changing it to a StringSet could further drop it to 0.5MB. I would guess
that any performance impact would depend on whether the total size of
'interned' would fit inside L2 cache or not.
There is a small bug in the original patch adding the string to the set
failed. Namely it would return "t == NULL" which would be "t != s" and
the intern in place would end up setting your pointer to NULL rather
than doing nothing and clearing the error code.
So I guess some of it comes down to whether "loweis" would also reject
this change on the basis that mathematically a "set is not a dict".
Though given that his claim "nobody else is speaking in favor of the
patch", while at least Colin Winter has expressed some interest at this
I'm trying to write a C extension which is a subclass of dict.
I want to do something like a setdefault() but with a single lookup.
Looking through the dictobject code, the three workhorse
routines lookdict, insertdict and dictresize are not available
directly for functions outside dictobject.c,
but I can get at lookdict through dict->ma_lookup().
So I use lookdict to get the PyDictEntry (call it ep) I'm looking for.
The comments for lookdict say ep is ready to be set... so I do that.
Then I check whether the dict needs to be resized--following the
nice example of PyDict_SetItem. But I can't call dictresize to finish
off the process.
Should I be using PyDict_SetItem directly? No... it does its own
I don't want a second lookup! I already know which entry will be
So then I look at the code for setdefault and it also does
a double lookup for checking and setting an entry.
What subtle issue am I missing?
Why does setdefault do a double lookup?
More globally, why isn't dictresize available through the C-API?
If there isn't a reason to do a double lookup I have a patch for
but I thought I should ask here first.
Just to make sure I am not doing something silly, with a configure line as
such: ./configure --prefix=/home/asmodai/local --with-wide-unicode
--with-pymalloc --with-threads --with-computed-gotos, would there be any
reason why I am getting the following error with both BSD make and gmake:
make: don't know how to make ./Modules/_fileio.c. Stop
[Will log an issue if it turns out to, indeed, be a problem with the tree
and not me.]
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Forgive us our trespasses, as we forgive those that trespass against us...
(going back on-list)
On 05/04/2009 15:42, Alexandre Vassalotti wrote:
>> I'm pretty sure that we'll need to reconvert; I don't think the current
>> conversion is particularly good.
> What is bad about it?
For one thing, it has the [svn] prefixes, which I found to be quite
ugly. hgsubversion in many cases will preserve the rev order from svn so
that the local revision numbers that hg shows will be the same as in SVN
anyway. On top of that, good conversion tools save the svn revision in
the revision metadata in hg, so that you can see it with log --debug.
For another, I'd like to use an author map to bring the revision authors
more in line with what Mercurial repositories usually display; this
helps with tool support and is also just a nicer solution IMO.
I have a stab at an author map at http://dirkjan.ochtman.nl/author-map.
Could use some review, but it seems like a good start.
> I largely prefer clone to named branches. From personal experience, I
> found named branches difficult to use properly. And, I think even
> Mercurial developers don't use them.
No, the Mercurial project currently doesn't use them. Mozilla does use
them at the moment, because they found they did have some advantages
(especially lower disk usage because no separate clones were needed). I
think named branches are fine for long-lived branches.
At the very least we should have a proper discussion over this.
> How do you reorder the revlog of a repository?
There are scripts for this which can be investigated.
> I am in favor of pruning the old branches, but not of leaving the old
> history behind. The current Mercurial mirror of py3k is 92M on my disk
> which is totally reasonable. So, I don't see what would be the
> advantage there.
The current Mercurial mirror for py3k also doesn't include any history
from before it was branched, which is bad, IMO. In order to get the most
of the DVCS structure, it would be helpful if py3k shared history with
the normal (trunk) branches.
> I was thinking of something very basic—e.g., something like a commit
> hook that would asynchronously commit the latest revision to svn. We
> wouldn't to keep convert much meta-data just the committer's name and
> the changelog would be fine.
What's the use case, who do you want to support with this? hgweb
trivially provides tarballs for download on every revision, so people
who don't want to use hg can easily download a snapshot.
> Not really. Currently, core developers can only push stuff using the
> Bazaar setup. Personally, I think SSH access would be a lot nicer, but
> this will depend how confident python.org's admins are with this idea.
We could still enable pushing through http(s) for hgweb(dir).
At 10:51 AM 4/8/2009 -0700, Guido van Rossum wrote:
>I would like it even less if an API cared about the
>*actual* signature of a function I pass into it.
One notable use of callable argument inspection is Bobo, the
12-years-ago predecessor to Zope, which used argument information to
determine form or query string parameter names. (Were Bobo being
written for the first time today for Python 3, I imagine it would use
argument annotations to specify types, instead of requiring them to
be in the client-side field names.)
Bobo, of course, is just a single case of the general pattern of
tools that expose a callable to some other (possibly
explicitly-typed) system. E.g., wrapping Python functions for
exposure to C, Java, .NET, CORBA, SOAP, etc.
Anyway, it's nice for decorators to be transparent to inspection when
the decorator doesn't actually modify the calling signature, so that
you can then use your decorated functions with tools like the above.
Someone listed this URL on c.l.py and I thought it would make a good
reference addition to PEP 374 (DVCS decision):
Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/
"...string iteration isn't about treating strings as sequences of strings,
it's about treating strings as sequences of characters. The fact that
characters are also strings is the reason we have problems, but characters
are strings for other good reasons." --Aahz
This issue has been largely resolved, but there is an outstanding bug where
the (reviewed and committed) solution does not work on certain versions of
FreeBSD (broken in 6.3, working in 7+). Do we have a list of 'supported
platforms', and is FreeBSD 6.3 in it?
What's the policy with regards to supporting dependencies like this? Should
I set this issue to 'pending' seeing as no-one is currently working on a
patch for this? Or is leaving this open and hanging around exactly the right
thing to do?
"Don't believe everything you think"