Bayesian filters are pretty robust in the face of corpus
contamination, if you have a threshold for the number of
occurrences of a word that you'll consider. If you don't
do that, then yes, a single legit email in your spam
corpus could cause your filters to reject every similar
A single email could easily contain five to eight words
that never occur in any other email. (Username, domain
name, server name, street address, etc.) If this got
into your spam corpus by mistake, then every succeeding
email from the same person would be classified as spam.
What this means is that you may want to use slightly
different thresholds for occurrences depending on how
much you trust the (human) classifier. For an app to be
used by end users, you might want to have a high threshold,
like 20 occurrences.
I find from my own experience that I often misclassify
mail. I seem to be more likely to put spam in a legit
mail folder than the reverse. But, as you guys found,
the first result of testing your filters tends to be to
clean up such mistakes.
--Greg Ward wrote:
> On 27 August 2002, Tim Peters said:
> > Setting this up has been a bitch. All early attempts floundered because it
> > turned out there was *some* systematic difference between the ham and spam
> > archives that made the job trivial.
> > The ham archive: I selected 20,000 messages, and broke them into 5 sets of
> > 4,000 each, at random, from a python-list archive Barry put together,
> > containing msgs only after SpamAssassin was put into play on python.org.
> > It's hoped that's pretty clean, but nobody checked all ~= 160,000+ msgs. As
> > will be seen below, it's not clean enough.
> One of the other perennial-seeming topics on spamassassin-devel (a list
> that I follow only sporodically) is that careful manual cleaning of your
> corpus is *essential*. The concern of the main SA developers is that
> spam in your non-spam folder (and vice-versa) will prejudice the genetic
> algorithm that evolves SA's scores in the wrong direction. Gut instinct
> tells me the Bayesian approach ought to be more robust against this sort
> of thing, but even it must have a breaking point at which misclassified
> messages throw off the probabilities.
> But that's entirely consistent with your statement:
> > Another lesson reinforces
> > one from my previous life in speech recognition: rigorous data collection,
> > cleaning, tagging and maintenance is crucial when working with statisical
> > approaches, and is damned expensive to do.
> On corpus collection...
> > The spam archive: This is essentially all of Bruce Guenter's 2002 spam
> > collection, at <http://www.em.ca/~bruceg/spam/>. It was broken at random
> > into 5 sets of 2,750 spams each.
> One possibility occurs to me: we could build our own corpus by
> collecting spam on python.org for a few weeks. Here's a rough breakdown
> of mail rejected by mail.python.org over the last 10 days,
> eyeball-estimated messages per day:
> bad RCPT 150 - 300 
> bad sender 50 - 190 
> relay denied 20 - 180 
> known spammer addr/domain 15 - 60
> 8-bit chars in subject 130 - 200
> 8-bit chars in header addrs 10 - 60
> banned charset in subject 5 - 50 
> "ADV" in subject 0 - 5
> no Message-Id header 100 - 400 
> invalid header address syntax 5 - 50 
> no valid senders in header 10 - 15 
> rejected by SpamAssassin 20 - 50 
> quarantined by SpamAssassin 5 - 50 
>  this includes mail accidentally sent to eg. giudo(a)python.org,
> but based on scanning the reject logs, I'd say the vast majority
> is spam. However, such messages are rejected after RCPT TO,
> so we never see the message itself. Most of the bad recipient
> addrs are either ancient (ipc6(a)python.org,
> grail-feedback(a)python.org) or fictitious (success(a)python.org,
>  sender verification failed, eg. someone tried to claim an
> envelope sender like foo(a)bogus.domain. Usually spam, but innocent
> bystanders can be hit by DNS servers suddenly exploding (hello,
> comcast.net). This only includes hard failures (DNS "no such
> domain"), not soft failures (DNS timeout).
>  I'd be leery of accepting mail that's trying to hijack
> mail.python.org as an open relay, even though that would
> be a goldmine of spam. (OTOH, we could reject after the
> DATA command, and save the message anyways.)
>  mail.python.org rejects any message with a properly MIME-encoded
> subject using any of the following charsets:
> big5, euc-kr, gb2312, ks_c_5601-1987
>  includes viruses as well as spam (and no doubt some innocent
> false positives, although I have added exemptions for the MUA/MTA
> combinations that most commonly result in legit mail reaching
> mail.python.org without a Message-Id header, eg. KMail/qmail)
>  eg. "To: all my friends" or "From: <>"
>  no valid sender address in any header line -- eg. someone gives a
> valid MAIL FROM address, but then puts "From: blah(a)bogus.domain"
> in the headers. Easily defeated with a "Sender" or "Reply-to"
>  any message scoring >= 10.0 is rejected at SMTP time; any
> message scoring >= 5.0 but < 10 is saved in /var/mail/spam
> for later review
> Executive summary:
> * it's a good thing we do all those easy checks before involving
> SA, or the load on the server would be a lot higher
> * give me 10 days of spam-harvesting, and I can equal Bruce
> Guenter's spam archive for 2002. (Of course, it'll take a couple
> of days to set the mail server up for the harvesting, and a couple
> more days to clean through the ~2000 caught messages, but you get
> the idea.)
> > + Mailman added distinctive headers to every message in the ham
> > archive, which appear nowhere in the spam archive. A Bayesian
> > classifier picks up on that immediately.
> > + Mailman also adds "[name-of-list]" to every Subject line.
> Perhaps that spam-harvesting run should also set aside a random
> selection of apparently-non-spam messages received at the same time.
> Then you'd have a corpus of mail sent to the same server, more-or-less
> to the same addresses, over the same period of time.
> Oh, any custom corpus should also include the ~300 false positives and
> ~600 false negatives gathered since SA started running on
> mail.python.org in April.
Well, if SpamAssassin wasn't so stupid, I suppose you could have read this
From: python-dev-admin(a)python.org [mailto:firstname.lastname@example.org]
Sent: Tuesday, August 27, 2002 10:38 PM
Subject: Your message to Python-Dev awaits moderator approval
Your mail to 'Python-Dev' with the subject
The first trustworthy <wink> GBayes results
Is being held until the list moderator can review it for approval.
The reason it is being held:
Message has a suspicious header
Either the message will get posted to the list, or you will receive
notification of the moderator's decision.
[Following up to a message that went to the checkins list.]
> Note change in behavior from 1.5.2. The new argument to
> NameError is an error message and not just the missing name.
Skip Montanaro writes:
> It seems to me that somewhere in the docs it would be worthwhile to state
> Messages to exceptions are not part of the Python API. Their contents
> may change from one version of Python to the next without warning and
> should not be relied on for code which will be run with multiple
> versions of the interpreter.
The catch, of course, is that it's not clear (perhaps only to me?)
that what changed was a message. I'd interpret the original behavior
(if documented, which I won't bother to check) as an API requirement.
AttributeError use to have a similar behavior; I don't know how
rigorously that's been maintained either.
In either case, I think the ideal solution to the problem of figuring
out what went wrong, from within the executing program, is for these
errors to have an attribute that identifies the missing name ("name"
would be a good name for it). KeyError could similarly have an
attribute "key". To deal with existing code, the attributes would not
be set. Additional C functions could be provided for use in code that
is modified to provide the information.
Fred L. Drake, Jr. <fdrake at acm.org>
PythonLabs at Zope Corporation
Did somebody already test python 2.2.1 using valgrind 1.0.1?
because testing one of my own modules, I get the following
recurrent error report whenever I import something:
==19951== Conditional jump or move depends on uninitialised value(s)
==19951== at 0x8094B85: find_module (in
==19951== by 0x8095DE2: import_submodule (Python/import.c:1887)
==19951== by 0x80959B8: load_next (Python/import.c:1752)
==19951== by 0x8097608: import_module_ex (Python/import.c:1603)
and I don't know if I can ignore it or if this is a real python error.
PS: the error appears whenever you import a module, so,
after installing valgrind, try doing:
echo import math > test.py
valgrind /usr/local/bin/python test.py
if you have python installed in the /usr/local/bin directory, of course.
I think the patch associated with this thread has an unintended
Zack pointed out three flaws in the original code:
Third, if an error other than the expected one comes back, the
loop clobbers the saved exception info and keeps going. Consider
the situation where PATH=/bin:/usr/bin, /bin/foobar exists but is
not executable by the invoking user, and /usr/bin/foobar does not
exist. The exception thrown will be 'No such file or directory',
not the expected 'Permission denied'.
The patch, as I understand it, changes the behaviour so as to raise
the exception "Permission denied" in this case.
Consider a similar situation in which both /bin/foobar (not executable
by the user) and /usr/bin/foobar (executable by the user) exist.
Given the command "foobar", the shell will execute /usr/bin/foobar.
If I understand the patch correctly, python will give up when it
encounters /bin/foobar and raise the "Permission denied" exception.
I believe this just happened to me today. I had a shell script named
"gcc" in ~/bin (first on my path) some months back. When I was
finished with it, I just did "chmod -x ~/bin/gcc" and forgot about it.
Today was the first time since this patch went in that I ran gcc via
python (using scipy's weave). Boy was I surprised at the message
"unable to execute gcc: Permission denied"!
I guess the fix is to save the EPERM exception and keep going
in case there is an executable later in the path.
> [rexec compromised by deleting __builtins__]
> This has been known for a while, see python.org/sf/577530.
> My recommendation is the same as always: don't trust rexec.
> --Guido van Rossum (home page: http://www.python.org/~guido/)
I think it is a VERY BAD idea to advertise publicly that rexec can be
used to "safely" restrict execution, while privately (ie, the above
postings to a developers-only list and to sourceforge).
Therefore I propose that the official documentation to the Python
Library Reference for the module rexec be modified to add a note saying
that rexec is not completely reliable and can be undermined by a
knowledgable hacker. The current documentation STRONGLY implies this is
NOT the case by explaining in detail the more minor susceptibility to
DOS attacks (memory or CPU time) and raising SystemExit.
Why not add something like the following to the beginning of the module
Warning: While the rexec module is designed to perform as described
below, it does have a few known vulnerabilities which could be exploited
by carefully written code. Thus it should not be relied upon in
situations requiring "production ready" security. In such situations,
execution via sub-processes (a separate Python executable) or very
careful "cleansing" of data to be processed may be necessary.
Alternatively, help in patching known rexec vulnerabilities would be
Admitting to library weaknesses (especially in the area of security)
doesn't make great PR, but at least it's honest!
-- Michael Chermside
In the `heapq' module, I'm a little bothered by the fact modules names have
`heap' as a prefix in their name. If the methods have been installed as
standard list methods, it would be quite understandable, but it has been
The most usual way of using a module is:
from MODULE import METHOD
and we should name METHODs accordingly, not repeating the MODULE as prefix.
This is a rather common usage, almost everywhere in the Python library.
So my suggestion of changing now, before `heapq' gets released for real:
heappush -> push
heappop -> pop
heapreplace -> replace
I guess that `heapify' is OK as it stands.
The example should be changed accordingly, that is, using `import heapq'
instead of `from heapq import such-and-such', using `heapq.push' instead of
`heappush' and `heapq.pop' instead of `heappop'.
François Pinard http://www.iro.umontreal.ca/~pinard
while looking for efficient ways to manipulate large files (>2Gb) with
python i noted an artificial limitation in the mmap module present in the
standard library. right now mmap objects behave like an hybrid between a
file and a string, but their size is limited to 2Gb files on 32bit
architectures (the offset argument in the mmap call is always set to 0 and
several members of the structure have type size_t).
adding a rough implementation for 64bit offset in the mmap call is trivial
(i have done it, cutting and pasting from fileobject.c), but it is not
obvious how the file-like soul of the mmap object should be affected by the
offset. actually, it is not clear to me why the file-like behavior is
present at all.
is there any plan to add LFS to the mmap module?
are there known workaround?