[Tutor] Newbie in Python (fwd)
dyoo at hkn.eecs.berkeley.edu
Sun Mar 13 22:50:50 CET 2005
---------- Forwarded message ----------
Date: Sun, 13 Mar 2005 21:58:09 +1100
From: oscar ng <advancetravel at aapt.net.au>
To: 'Danny Yoo' <dyoo at hkn.eecs.berkeley.edu>
Subject: RE: [Tutor] Newbie in Python
Thanks for the reply..i wasn't sure how this works so I am glad there is
someone that might be able to help me. Because this is an university
assignment I am not sure how much of help you can provide..but here it
I need to build a mail filtering system that detects sorts mail messages
into appropriate categories, such as spam, job announcement and
The assignment will be in two parts. In the first part you will try your
own approaches to solving the problem, using the Natural Language
Toolkit package for Python. In the second part, you will use the
techniques learned in the classes on text classification, and compare
the results of these and your own.
What Is Given
The target dataset consists of four types of documents, a list of spam
mail messages and a list of messages sent to various newsgroups. The
four types of documents are located in different directories. Each
document is formatted as an email message with the main text and two
email headers: From and Subject. All the HTML code has been removed.
Below is an example of a message from the corpus:
From: edward465tom at estpak.ee
Subject: YOUR APPLICATION HAS BEEN APPROVED
You Have Been APPROVED
for 3 UNSECURED VISA and MASTERCARDS!
Are you at least 18 Years of age?
Have a Valid Social Security No?
Income of at Least $99 p/week?
Our Banks offer:
INSTANT FREE ONLINE APPROVAL!
Receive your cards in as little as
TWO Weeks from Today!
Just in Time for Summer Vacation!
For more information on how you can get your Visa or
Mastercards NOW, click on the link below:
MailTo:creditcards4you at excite.com?Subject=creditcardinfo
If you are no longer interested in receiving information
on Credit Cards or Financial Services, please click on
the link below and you will be removed from our optin list.
MailTo:creditcardsusa at excite.com?Subject=optoutfinancial
The four categories are as follows:
job announcements (now available)
conference announcements (now available)
What your code should do for Part I, then, is to tokenise the files,
classify the emails according to your own algorithm, and output the
results of the classification. Your algorithm might specify, for
example, that emails with greater than X% of capitalised words are spam.
Your algorithm for this part can be quite simple; the main aim is to get
the infrastructure built for Part II, and to get you thinking about what
is involved in these sorts of systems.
The output of your code might look as follows:
24 messages are SPAM (77% correct): msg-a-2 msg-a-3 ...
11 messages are JOB ANN (63% correct): ja-4 ja-6 ...
35 messages are CONF ANN (84% correct): ca-1 ca-3 msg-a-11 ...
9 messages are OTHER (22% correct): 10000 10001 ...
---I am stuck in understanding how I can go about opening the
folder(directory) that contains all the files that I need to process for
this assignment. As the folder contains sub folders ie and then the
email files that need to be processed.
Thanks for your time in reading this and hope to hear from you soon..
If you need more info there is a link
From: Danny Yoo [mailto:dyoo at hkn.eecs.berkeley.edu]
Sent: Friday, 11 March 2005 6:04 AM
To: oscar ng
Cc: tutor at python.org
Subject: Re: [Tutor] Newbie in Python
On Thu, 10 Mar 2005, oscar ng wrote:
> Needing help on a mail filtering system that explores the headers and
> text and determines which category the email falls into.
Ok. What help do you need? You have not told us what problems you're
having, so we're stuck just twiddling our thumbs. *grin*
Are you already aware of projects that do this, or are you doing this
fun? The SpamBayes project has quite a bit of source code that may
More information about the Tutor