[Python-Dev] Re: [Python-checkins]
Wed, 21 Aug 2002 23:39:34 -0400
>> Do you expect that to be an issue? When I built a database from 2=
>> messages, the whole thing fit in a Python dict consuming about 10M=
[Eric S. Raymond]
> Hm, that's a bit smaller than I would have thought, but the order o=
> magnitude I was expecting.
It's even smaller than that <wink>. The dict here maps strings to in=
of a Python class (WordInfo). The latter uses new-in-2.2 __slots__, =
those give major memory efficiences over old-style classes, but there=
still subtantial memory overhead compared to what's possible in C. I=
addition, there are memory overheads for the Python objects stored in=
WordInfo instances, including a Python float object in each record re=
the time.time() of last access by the scoring method.
IOW, there are tons of memory overheads here, yet the size was still =
So I have no hesitation leaving this part in Python, and coding this =
was a trivial finger exercise. You know all that, though! It makes =
decision to use C from the start hard to fathom.
> Recognition features should age! Wow! That's a good point! With =
> age counter being reset when they're recognized.
For concreteness, here's the comment from the Python code, which I be=
# (*)atime is the last access time, a UTC time.time() value. It'=
# most recent time this word was used by scoring (i.e., by spampr=
# not by training via learn()); or, if the word has never been us=
# scoring, the time the word record was created (i.e., by learn()=
# One good criterion for identifying junk (word records that have=
# value) is to delete words that haven't been used for a long tim=
# Perhaps they were typos, or unique identifiers, or relevant to =
# once-hot topic or scam that's fallen out of favor. Whatever, i=
# a word is no longer being used, it's just wasting space.
Besides the space-saving gimmick, there may be practical value in exp=
older words that are getting used, but less frequently over time. Th=
would be evidence that the nature of the world is changing, and more
aggressively expiring the model for how the world *used* to be may sp=
adaptation to the new realities. I'm not saving enough info to do th=
though, and it's unclear whether it would really help. Against it, w=
see new spam gimmicks pop up regularly, the old ones never seem to go=
(e.g., I don't do anything to try to block spam on my email accounts,=
the bulk of the spam I get is still easily recognized from the subjec=
alone). However, because it's all written in Python <wink>, it will =
easy to set up experiments to answer such questions.
BTW, the ifile FAQ gives a little info about the expiration scheme if=
uses. Rennie's paper gives more:
Age is defined as the number of e-mail messages which have been
added to the model since frequency statistics have been kept for
the word. Old, infrequent words are to be dropped while young wo=
and old, frequent words should be kept. One way to quantify this
is to say that words which occur fewer than log2(age)-1 times
should be discarded from the model. For example, if =93baseball=
occurred in the 1st document and occurred 5 or fewer times in the
next 63 documents, the word and its corresponding statistics woul=
be eliminated from the model=92s database. This feature selectio=
cutoff is used in ifile and is found to significantly improve
efficiency without noticeably affecting classification performanc=
I'm not sure how that policy would work with Graham's scheme (which h=
differences from the more conventional scheme ifile uses). Our Pytho=
also saves a count of the number of times each word makes it into Gra=
"best 15" list, and I expect that to be a direct measure of the value=
getting out of keeping a word ("word" means whatever the tokenizer pa=
the classifier -- it's really any hashable and (for now) pickleable P=
> I thought part of the point of the method was that you get
> sorting for free because of the way elements are inserted.
Sure, if range-search or final sorted order is important, it's a grea=
benefit. I was only wondering about why you'd expect spatial localit=
the input as naturally ordered.
> No, but think about how the pointer in a binary search moves. It's
> spatially bursty, Memory accesses frequencies for repeated binary
> searches will be a sum of bursty signals, analogous to the way
> network traffic volumes look in the time domain. In fact the
> graph of memory adress vs. number of accesses is gonna win up
> looking an awful lot like 1/f noise, I think. *Not* evenly
> distributed; something there for LRU to weork with.
Judy may or may not be able to exploit something here; I don't know, =
need to know a lot more about Judy's implementation to even start to =
Plain binary search has horrid cache behavior, though. Indeed, most =
research papers on B-Trees suggest that binary search doesn't buy any=
in even rather large B-Tree nodes, because the cache misses swamp the
reduced instruction count over what a simple linear search does. Wor=
that linear search can be significantly faster if the HW is smart eno=
detect the regular address access pattern of linear search and do som=
helpful prefetching for you. More recent research on B-Trees is on w=
get away from bad binary search behavior; two current lines are using
explicit prefetch instructions to minimize the stalls, and using a mo=
cache-friendly data structure inside a B-Tree node. Out-guessing mod=
implementations is damned hard.
With a Python dict you're likely to get a cache miss per lookup. If =
a disaster, time.clock() isn't revealing it <wink>.
> What I'm starting to test now is a refactoring of the program where=
> spawn a daemon version of itself first time it's called. The daemo=
> eats the wordlists and stays in core fielding requests from subsequ=
> program runs. Basically an answer to "how you call bogofilter 1K
> times a day from procmail without bringing your disks to their knee=
> problem" -- persistence on the cheap.
> Thing is that the solution to this problem is very generic. Might
> turn into a Python framework.
Sounds good! Just don't call it "compilerlike" <wink>.