Mailman 3 August 2002 - Python-Dev

PEP 1, PEP Purpose and Guidelines
by barry＠zope.com 18 May '21

18 May '21

It has been a while since I posted a copy of PEP 1 to the mailing lists and newsgroups. I've recently done some updating of a few sections, so in the interest of gaining wider community participation in the Python development process, I'm posting the latest revision of PEP 1 here. A version of the PEP is always available on-line at http://www.python.org/peps/pep-0001.html Enjoy, -Barry -------------------- snip snip -------------------- PEP: 1 Title: PEP Purpose and Guidelines Version: $Revision: 1.36 $ Last-Modified: $Date: 2002/07/29 18:34:59 $ Author: Barry A. Warsaw, Jeremy Hylton Status: Active Type: Informational Created: 13-Jun-2000 Post-History: 21-Mar-2001, 29-Jul-2002 What is a PEP? PEP stands for Python Enhancement Proposal. A PEP is a design document providing information to the Python community, or describing a new feature for Python. The PEP should provide a concise technical specification of the feature and a rationale for the feature. We intend PEPs to be the primary mechanisms for proposing new features, for collecting community input on an issue, and for documenting the design decisions that have gone into Python. The PEP author is responsible for building consensus within the community and documenting dissenting opinions. Because the PEPs are maintained as plain text files under CVS control, their revision history is the historical record of the feature proposal[1]. Kinds of PEPs There are two kinds of PEPs. A standards track PEP describes a new feature or implementation for Python. An informational PEP describes a Python design issue, or provides general guidelines or information to the Python community, but does not propose a new feature. Informational PEPs do not necessarily represent a Python community consensus or recommendation, so users and implementors are free to ignore informational PEPs or follow their advice. PEP Work Flow The PEP editor, Barry Warsaw <peps(a)python.org>, assigns numbers for each PEP and changes its status. The PEP process begins with a new idea for Python. It is highly recommended that a single PEP contain a single key proposal or new idea. The more focussed the PEP, the more successfully it tends to be. The PEP editor reserves the right to reject PEP proposals if they appear too unfocussed or too broad. If in doubt, split your PEP into several well-focussed ones. Each PEP must have a champion -- someone who writes the PEP using the style and format described below, shepherds the discussions in the appropriate forums, and attempts to build community consensus around the idea. The PEP champion (a.k.a. Author) should first attempt to ascertain whether the idea is PEP-able. Small enhancements or patches often don't need a PEP and can be injected into the Python development work flow with a patch submission to the SourceForge patch manager[2] or feature request tracker[3]. The PEP champion then emails the PEP editor <peps(a)python.org> with a proposed title and a rough, but fleshed out, draft of the PEP. This draft must be written in PEP style as described below. If the PEP editor approves, he will assign the PEP a number, label it as standards track or informational, give it status 'draft', and create and check-in the initial draft of the PEP. The PEP editor will not unreasonably deny a PEP. Reasons for denying PEP status include duplication of effort, being technically unsound, not providing proper motivation or addressing backwards compatibility, or not in keeping with the Python philosophy. The BDFL (Benevolent Dictator for Life, Guido van Rossum) can be consulted during the approval phase, and is the final arbitrator of the draft's PEP-ability. If a pre-PEP is rejected, the author may elect to take the pre-PEP to the comp.lang.python newsgroup (a.k.a. python-list(a)python.org mailing list) to help flesh it out, gain feedback and consensus from the community at large, and improve the PEP for re-submission. The author of the PEP is then responsible for posting the PEP to the community forums, and marshaling community support for it. As updates are necessary, the PEP author can check in new versions if they have CVS commit permissions, or can email new PEP versions to the PEP editor for committing. Standards track PEPs consists of two parts, a design document and a reference implementation. The PEP should be reviewed and accepted before a reference implementation is begun, unless a reference implementation will aid people in studying the PEP. Standards Track PEPs must include an implementation - in the form of code, patch, or URL to same - before it can be considered Final. PEP authors are responsible for collecting community feedback on a PEP before submitting it for review. A PEP that has not been discussed on python-list(a)python.org and/or python-dev(a)python.org will not be accepted. However, wherever possible, long open-ended discussions on public mailing lists should be avoided. Strategies to keep the discussions efficient include, setting up a separate SIG mailing list for the topic, having the PEP author accept private comments in the early design phases, etc. PEP authors should use their discretion here. Once the authors have completed a PEP, they must inform the PEP editor that it is ready for review. PEPs are reviewed by the BDFL and his chosen consultants, who may accept or reject a PEP or send it back to the author(s) for revision. Once a PEP has been accepted, the reference implementation must be completed. When the reference implementation is complete and accepted by the BDFL, the status will be changed to `Final.' A PEP can also be assigned status `Deferred.' The PEP author or editor can assign the PEP this status when no progress is being made on the PEP. Once a PEP is deferred, the PEP editor can re-assign it to draft status. A PEP can also be `Rejected'. Perhaps after all is said and done it was not a good idea. It is still important to have a record of this fact. PEPs can also be replaced by a different PEP, rendering the original obsolete. This is intended for Informational PEPs, where version 2 of an API can replace version 1. PEP work flow is as follows: Draft -> Accepted -> Final -> Replaced ^ +----> Rejected v Deferred Some informational PEPs may also have a status of `Active' if they are never meant to be completed. E.g. PEP 1. What belongs in a successful PEP? Each PEP should have the following parts: 1. Preamble -- RFC822 style headers containing meta-data about the PEP, including the PEP number, a short descriptive title (limited to a maximum of 44 characters), the names, and optionally the contact info for each author, etc. 2. Abstract -- a short (~200 word) description of the technical issue being addressed. 3. Copyright/public domain -- Each PEP must either be explicitly labelled as placed in the public domain (see this PEP as an example) or licensed under the Open Publication License[4]. 4. Specification -- The technical specification should describe the syntax and semantics of any new language feature. The specification should be detailed enough to allow competing, interoperable implementations for any of the current Python platforms (CPython, JPython, Python .NET). 5. Motivation -- The motivation is critical for PEPs that want to change the Python language. It should clearly explain why the existing language specification is inadequate to address the problem that the PEP solves. PEP submissions without sufficient motivation may be rejected outright. 6. Rationale -- The rationale fleshes out the specification by describing what motivated the design and why particular design decisions were made. It should describe alternate designs that were considered and related work, e.g. how the feature is supported in other languages. The rationale should provide evidence of consensus within the community and discuss important objections or concerns raised during discussion. 7. Backwards Compatibility -- All PEPs that introduce backwards incompatibilities must include a section describing these incompatibilities and their severity. The PEP must explain how the author proposes to deal with these incompatibilities. PEP submissions without a sufficient backwards compatibility treatise may be rejected outright. 8. Reference Implementation -- The reference implementation must be completed before any PEP is given status 'Final,' but it need not be completed before the PEP is accepted. It is better to finish the specification and rationale first and reach consensus on it before writing code. The final implementation must include test code and documentation appropriate for either the Python language reference or the standard library reference. PEP Template PEPs are written in plain ASCII text, and should adhere to a rigid style. There is a Python script that parses this style and converts the plain text PEP to HTML for viewing on the web[5]. PEP 9 contains a boilerplate[7] template you can use to get started writing your PEP. Each PEP must begin with an RFC822 style header preamble. The headers must appear in the following order. Headers marked with `*' are optional and are described below. All other headers are required. PEP: <pep number> Title: <pep title> Version: <cvs version string> Last-Modified: <cvs date string> Author: <list of authors' real names and optionally, email addrs> * Discussions-To: <email address> Status: <Draft | Active | Accepted | Deferred | Final | Replaced> Type: <Informational | Standards Track> * Requires: <pep numbers> Created: <date created on, in dd-mmm-yyyy format> * Python-Version: <version number> Post-History: <dates of postings to python-list and python-dev> * Replaces: <pep number> * Replaced-By: <pep number> The Author: header lists the names and optionally, the email addresses of all the authors/owners of the PEP. The format of the author entry should be address(a)dom.ain (Random J. User) if the email address is included, and just Random J. User if the address is not given. If there are multiple authors, each should be on a separate line following RFC 822 continuation line conventions. Note that personal email addresses in PEPs will be obscured as a defense against spam harvesters. Standards track PEPs must have a Python-Version: header which indicates the version of Python that the feature will be released with. Informational PEPs do not need a Python-Version: header. While a PEP is in private discussions (usually during the initial Draft phase), a Discussions-To: header will indicate the mailing list or URL where the PEP is being discussed. No Discussions-To: header is necessary if the PEP is being discussed privately with the author, or on the python-list or python-dev email mailing lists. Note that email addresses in the Discussions-To: header will not be obscured. Created: records the date that the PEP was assigned a number, while Post-History: is used to record the dates of when new versions of the PEP are posted to python-list and/or python-dev. Both headers should be in dd-mmm-yyyy format, e.g. 14-Aug-2001. PEPs may have a Requires: header, indicating the PEP numbers that this PEP depends on. PEPs may also have a Replaced-By: header indicating that a PEP has been rendered obsolete by a later document; the value is the number of the PEP that replaces the current document. The newer PEP must have a Replaces: header containing the number of the PEP that it rendered obsolete. PEP Formatting Requirements PEP headings must begin in column zero and the initial letter of each word must be capitalized as in book titles. Acronyms should be in all capitals. The body of each section must be indented 4 spaces. Code samples inside body sections should be indented a further 4 spaces, and other indentation can be used as required to make the text readable. You must use two blank lines between the last line of a section's body and the next section heading. You must adhere to the Emacs convention of adding two spaces at the end of every sentence. You should fill your paragraphs to column 70, but under no circumstances should your lines extend past column 79. If your code samples spill over column 79, you should rewrite them. Tab characters must never appear in the document at all. A PEP should include the standard Emacs stanza included by example at the bottom of this PEP. A PEP must contain a Copyright section, and it is strongly recommended to put the PEP in the public domain. When referencing an external web page in the body of a PEP, you should include the title of the page in the text, with a footnote reference to the URL. Do not include the URL in the body text of the PEP. E.g. Refer to the Python Language web site [1] for more details. ... [1] http://www.python.org When referring to another PEP, include the PEP number in the body text, such as "PEP 1". The title may optionally appear. Add a footnote reference that includes the PEP's title and author. It may optionally include the explicit URL on a separate line, but only in the References section. Note that the pep2html.py script will calculate URLs automatically, e.g.: ... Refer to PEP 1 [7] for more information about PEP style ... References [7] PEP 1, PEP Purpose and Guidelines, Warsaw, Hylton http://www.python.org/peps/pep-0001.html If you decide to provide an explicit URL for a PEP, please use this as the URL template: http://www.python.org/peps/pep-xxxx.html PEP numbers in URLs must be padded with zeros from the left, so as to be exactly 4 characters wide, however PEP numbers in text are never padded. Reporting PEP Bugs, or Submitting PEP Updates How you report a bug, or submit a PEP update depends on several factors, such as the maturity of the PEP, the preferences of the PEP author, and the nature of your comments. For the early draft stages of the PEP, it's probably best to send your comments and changes directly to the PEP author. For more mature, or finished PEPs you may want to submit corrections to the SourceForge bug manager[6] or better yet, the SourceForge patch manager[2] so that your changes don't get lost. If the PEP author is a SF developer, assign the bug/patch to him, otherwise assign it to the PEP editor. When in doubt about where to send your changes, please check first with the PEP author and/or PEP editor. PEP authors who are also SF committers, can update the PEPs themselves by using "cvs commit" to commit their changes. Remember to also push the formatted PEP text out to the web by doing the following: % python pep2html.py -i NUM where NUM is the number of the PEP you want to push out. See % python pep2html.py --help for details. Transferring PEP Ownership It occasionally becomes necessary to transfer ownership of PEPs to a new champion. In general, we'd like to retain the original author as a co-author of the transferred PEP, but that's really up to the original author. A good reason to transfer ownership is because the original author no longer has the time or interest in updating it or following through with the PEP process, or has fallen off the face of the 'net (i.e. is unreachable or not responding to email). A bad reason to transfer ownership is because you don't agree with the direction of the PEP. We try to build consensus around a PEP, but if that's not possible, you can always submit a competing PEP. If you are interested assuming ownership of a PEP, send a message asking to take over, addressed to both the original author and the PEP editor <peps(a)python.org>. If the original author doesn't respond to email in a timely manner, the PEP editor will make a unilateral decision (it's not like such decisions can be reversed. :). References and Footnotes [1] This historical record is available by the normal CVS commands for retrieving older revisions. For those without direct access to the CVS tree, you can browse the current and past PEP revisions via the SourceForge web site at http://cvs.sourceforge.net/cgi-bin/cvsweb.cgi/python/nondist/peps/?cvsroot=… [2] http://sourceforge.net/tracker/?group_id=5470&atid=305470 [3] http://sourceforge.net/tracker/?atid=355470&group_id=5470&func=browse [4] http://www.opencontent.org/openpub/ [5] The script referred to here is pep2html.py, which lives in the same directory in the CVS tree as the PEPs themselves. Try "pep2html.py --help" for details. The URL for viewing PEPs on the web is http://www.python.org/peps/ [6] http://sourceforge.net/tracker/?group_id=5470&atid=305470 [7] PEP 9, Sample PEP Template http://www.python.org/peps/pep-0009.html Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 End:

8 14

RE: [Python-Dev] Those import related syntax errors again...
by Samuele Pedroni 23 Feb '05

23 Feb '05

Hi. [Mark Hammond] > The point isn't about my suffering as such. The point is more that > python-dev owns a tiny amount of the code out there, and I don't believe we > should put Python's users through this. > > Sure - I would be happy to "upgrade" all the win32all code, no problem. I > am also happy to live in the bleeding edge and take some pain that will > cause. > > The issue is simply the user base, and giving Python a reputation of not > being able to painlessly upgrade even dot revisions. I agree with all this. [As I imagined explicit syntax did not catch up and would require lot of discussions.] [GvR] > > Another way is to use special rules > > (similar to those for class defs), e.g. having > > > > <frag> > > y=3 > > def f(): > > exec "y=2" > > def g(): > > return y > > return g() > > > > print f() > > </frag> > > > > # print 3. > > > > Is that confusing for users? maybe they will more naturally expect 2 > > as outcome (given nested scopes). > > This seems the best compromise to me. It will lead to the least > broken code, because this is the behavior that we had before nested > scopes! It is also quite easy to implement given the current > implementation, I believe. > > Maybe we could introduce a warning rather than an error for this > situation though, because even if this behavior is clearly documented, > it will still be confusing to some, so it is better if we outlaw it in > some future version. > Yes this can be easy to implement but more confusing situations can arise: <frag> y=3 def f(): y=9 exec "y=2" def g(): return y return y,g() print f() </frag> What should this print? the situation leads not to a canonical solution as class def scopes. or <frag> def f(): from foo import * def g(): return y return g() print f() </frag> [Mark Hammond] > > This probably won't be a very popular suggestion, but how about pulling > > nested scopes (I assume they are at the root of the problem) > > until this can be solved cleanly? > > Agreed. While I think nested scopes are kinda cool, I have lived without > them, and really without missing them, for years. At the moment the cure > appears worse then the symptoms in at least a few cases. If nothing else, > it compromises the elegant simplicity of Python that drew me here in the > first place! > > Assuming that people really _do_ want this feature, IMO the bar should be > raised so there are _zero_ backward compatibility issues. I don't say anything about pulling nested scopes (I don't think my opinion can change things in this respect) but I should insist that without explicit syntax IMO raising the bar has a too high impl cost (both performance and complexity) or creates confusion. [Andrew Kuchling] > >Assuming that people really _do_ want this feature, IMO the bar should be > >raised so there are _zero_ backward compatibility issues. > > Even at the cost of additional implementation complexity? At the cost > of having to learn "scopes are nested, unless you do these two things > in which case they're not"? > > Let's not waffle. If nested scopes are worth doing, they're worth > breaking code. Either leave exec and from..import illegal, or back > out nested scopes, or think of some better solution, but let's not > introduce complicated backward compatibility hacks. IMO breaking code would be ok if we issue warnings today and implement nested scopes issuing errors tomorrow. But this is simply a statement about principles and raised impression. IMO import * in an inner scope should end up being an error, not sure about 'exec's. We will need a final BDFL statement. regards, Samuele Pedroni.

18 78

PEP: Support for System Upgrades
by M.-A. Lemburg 06 Jan '03

06 Jan '03

PEP: 0??? Title: Support for System Upgrades Version: $Revision: 0.0 $ Author: mal(a)lemburg.com (Marc-Andr? Lemburg) Status: Draft Type: Standards Track Python-Version: 2.3 Created: 19-Jul-2001 Post-History: Abstract This PEP proposes strategies to allow the Python standard library to be upgraded in parts without having to reinstall the complete distribution or having to wait for a new patch level release. Problem Python currently does not allow overriding modules or packages in the standard library per default. Even though this is possible by defining a PYTHONPATH environment variable (the paths defined in this variable are prepended to the Python standard library path), there is no standard way of achieving this without changing the configuration. Since Python's standard library is starting to host packages which are also available separately, e.g. the distutils, email and PyXML packages, which can also be installed independently of the Python distribution, it is desireable to have an option to upgrade these packages without having to wait for a new patch level release of the Python interpreter to bring along the changes. Proposed Solutions This PEP proposes two different but not necessarily conflicting solutions: 1. Adding a new standard search path to sys.path: $stdlibpath/system-packages just before the $stdlibpath entry. This complements the already existing entry for site add-ons $stdlibpath/site-packages which is appended to the sys.path at interpreter startup time. To make use of this new standard location, distutils will need to grow support for installing certain packages in $stdlibpath/system-packages rather than the standard location for third-party packages $stdlibpath/site-packages. 2. Tweaking distutils to install directly into $stdlibpath for the system upgrades rather than into $stdlibpath/site-packages. The first solution has a few advantages over the second: * upgrades can be easily identified (just look in $stdlibpath/system-packages) * upgrades can be deinstalled without affecting the rest of the interpreter installation * modules can be virtually removed from packages; this is due to the way Python imports packages: once it finds the top-level package directory it stay in this directory for all subsequent package submodule imports * the approach has an overall much cleaner design than the hackish install on top of an existing installation approach The only advantages of the second approach are that the Python interpreter does not have to changed and that it works with older Python versions. Both solutions require changes to distutils. These changes can also be implemented by package authors, but it would be better to define a standard way of switching on the proposed behaviour. Scope Solution 1: Python 2.3 and up Solution 2: all Python versions supported by distutils Credits None References None Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End: -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

4 13

PEP 282 Implementation
by Vinay Sajip 07 Oct '02

07 Oct '02

I've uploaded my logging module, the proposed implementation for PEP 282, for committer review, to the SourceForge patch manager: http://sourceforge.net/tracker/index.php?func=detail&aid=578494&group_id=547 0&atid=305470 I've assigned it to Mark Hammond as (a) he had posted some comments to Trent Mick's original PEP posting, and (b) Barry Warsaw advised not assigning to PythonLabs people on account of their current workload. The file logging.py is (apart from some test scripts) all that's supposed to go into Python 2.3. The file logging-0.4.6.tar.gz contains the module, an updated version of the PEP (which I mailed to Barry Warsaw on 26th June), numerous test/example scripts, TeX documentation etc. You can also refer to http://www.red-dove.com/python_logging.html Here's hoping for a speedy review :-) Regards, Vinay Sajip

5 12

Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8
by Skip Montanaro 16 Sep '02

16 Sep '02

tim> Straight character n-grams are very appealing because they're the tim> simplest and most language-neutral; I didn't have any luck with tim> them over the weekend, but the size of my training data was tim> trivial. Anybody up for pooling corpi (corpora?)? Skip

21 78

type categories
by Andrew Koenig 13 Sep '02

13 Sep '02

While I was driving to work today, I had a thought about the iterator/iterable discussion of a few weeks ago. My impression is that that discussion was inconclusive, but a few general principles emerged from it: 1) Some types are iterators -- that is, they support calls to next() and raise StopIteration when they have no more information to give. 2) Some types are iterables -- that is, they support calls to __iter__() that yield an iterator as the result. 3) Every iterator is also an iterable, because iterators are required to implement __iter__() as well as next(). 4) The way to determine whether an object is an iterator is to call its next() method and see what happens. 5) The way to determine whether an object is an iterable is to call its __iter__() method and see what happens. I'm uneasy about (4) because if an object is an iterator, calling its next() method is destructive. The implication is that you had better not use this method to test if an object is an iterator until you are ready to take irrevocable action based on that test. On the other hand, calling __iter__() is safe, which means that you can test nondestructively whether an object is an iterable, which includes all iterators. Here is what I realized this morning. It may be obvious to you, but it wasn't to me (until after I realized it, of course): ``iterator'' and ``iterable'' are just two of many type categories that exist in Python. Some other categories: callable sequence generator class instance type number integer floating-point number complex number mutable tuple mapping method built-in As far as I know, there is no uniform method of determining into which category or categories a particular object falls. Of course, there are non-uniform ways of doing so, but in general, those ways are, um, nonuniform. Therefore, if you want to check whether an object is in one of these categories, you haven't necessarily learned much about how to check if it is in a different one of these categories. So what I wonder is this: Has there been much thought about making these type categories more explicitly part of the type system?

24 131

Re: RE: [Python-Dev] The first trustworthy <wink> GBayes results
by Paul Graham 10 Sep '02

10 Sep '02

Don't count words multiple times, and you'll probably get fewer false positives. That's the main reason I don't do it-- because it magnifies the effect of some random word like water happening to have a big spam probability. (Incidentally, why so high? In my db it's only 0.3930784.) --pg --Tim Peters wrote: > FYI. After cleaning the blatant spam identified by the classifier out of my > ham corpus, and replacing it with new random msgs from Barry's corpus, the > reported false positive rate fell to about 0.2% (averaging 8 per each batch > of 4000 ham test messages). This seems remarkable given that it's ignoring > headers, and just splitting the raw text on whitespace in total ignorance of > HTML & MIME etc. > > 'FREE' (all caps) moved into the ranks of best spam indicators. The false > negative rate got reduced by a small amount, but I doubt it's a > statistically significant reduction (I'll compute that stuff later; I'm > looking for Big Things now). > > Some of these false positives are almost certainly spam, and at least one is > almost certainly a virus: these are msgs that are 100% base64-encoded, or > maximally obfuscated quoted-printable. That could almost certainly be fixed > by, e.g., decoding encoded text. > > The other false positives seem harder to deal with: > > + Brief HMTL msgs from newbies. I doubt the headers will help these > get through, as they're generally first-time posters, and aren't > replies to earlier msgs. There's little positive content, while > all elements of raw HTML have high "it's spam" probability. > > Example: > > """ > --------------=_4D4800B7C99C4331D7B8 > Content-Description: filename="text1.txt" > Content-Type: text/plain; charset=ISO-8859-1 > Content-Transfer-Encoding: quoted-printable > > Is there a version of Python with Prolog Extension?? > Where can I find it if there is? > > Thanks, > Luis. > > P.S. Could you please reply to the sender too. > > > --------------=_4D4800B7C99C4331D7B8 > Content-Description: filename="text1.html" > Content-Type: text/html > Content-Transfer-Encoding: quoted-printable > > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> > <HTML> > <HEAD> > <TITLE>Prolog Extension</TITLE> > <META NAME=3D"GENERATOR" CONTENT=3D"StarOffice/5.1 (Linux)"> > <META NAME=3D"CREATED" CONTENT=3D"19991127;12040200"> > <META NAME=3D"CHANGEDBY" CONTENT=3D"Luis Cortes"> > <META NAME=3D"CHANGED" CONTENT=3D"19991127;12044700"> > </HEAD> > <BODY> > <PRE>Is there a version of Python with Prolog Extension?? > Where can I find it if there is? > > Thanks, > Luis. > > P.S. Could you please reply to the sender too.</PRE> > </BODY> > </HTML> > > --------------=_4D4800B7C99C4331D7B8--""" > """ > > Here's how it got scored: > > prob = 0.999958816093 > prob('<META') = 0.957529 > prob('<META') = 0.957529 > prob('<META') = 0.957529 > prob('<BODY>') = 0.979284 > prob('Prolog') = 0.01 > prob('<HEAD>') = 0.97989 > prob('Thanks,') = 0.0337316 > prob('Prolog') = 0.01 > prob('Python') = 0.01 > prob('NAME=3D"GENERATOR"') = 0.99 > prob('<HTML>') = 0.99 > prob('</HTML>') = 0.989494 > prob('</BODY>') = 0.987429 > prob('Thanks,') = 0.0337316 > prob('Python') = 0.01 > > Note that '<META' gets penalized 3 times. More on that later. > > + Msgs talking *about* HTML, and including HTML in examples. This one > may be troublesome, but there are mercifully few of them. > > + Brief msgs with obnoxious employer-generated signatures. Example: > > """ > Hi there, > > I am looking for you recommendations on training courses available in the UK > on Python. Can you help? > > Thanks, > > Vickie Mills > IS Training Analyst > > Tel: 0131 245 1127 > Fax: 0131 245 1550 > E-mail: vickie_mills(a)standardlife.com > > For more information on Standard Life, visit our website > http://www.standardlife.com/ The Standard Life Assurance Company, Standard > Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland > (No SZ4) and regulated by the Personal Investment Authority. Tel: 0131 225 > 2552 - calls may be recorded or monitored. This confidential e-mail is for > the addressee only. If received in error, do not retain/copy/disclose it > without our consent and please return it to us. We virus scan all e-mails > but are not responsible for any damage caused by a virus or alteration by a > third party after it is sent. > """ > > The scoring: > > prob = 0.98654879055 > prob('our') = 0.928936 > prob('sent.') = 0.939891 > prob('Tel:') = 0.0620155 > prob('Thanks,') = 0.0337316 > prob('received') = 0.940256 > prob('Tel:') = 0.0620155 > prob('Hi') = 0.0533333 > prob('help?') = 0.01 > prob('Personal') = 0.970976 > prob('regulated') = 0.99 > prob('Road,') = 0.01 > prob('Training') = 0.99 > prob('e-mails') = 0.987542 > prob('Python.') = 0.01 > prob('Investment') = 0.99 > > The brief human-written part is fine, but the longer boilerplate sig is > indistinguishable from spam. > > + The occassional non-Python conference announcement(!). These are > long, so I'll skip an example. In effect, it's automated bulk email > trying to sell you a conference, so is prone to use the language and > artifacts of advertising. Here's typical scoring, for the TOOLS > Europe '99 conference announcement: > > prob = 0.983583974285 > prob('THE') = 0.983584 > prob('Object') = 0.01 > prob('Bell') = 0.01 > prob('Object-Oriented') = 0.01 > prob('**************************************************************') = > 0.99 > prob('Bertrand') = 0.01 > prob('Rational') = 0.01 > prob('object-oriented') = 0.01 > prob('CONTACT') = 0.99 > prob('**************************************************************') = > 0.99 > prob('innovative') = 0.99 > prob('**************************************************************') = > 0.99 > prob('Olivier') = 0.01 > prob('VISIT') = 0.99 > prob('OUR') = 0.99 > > Note the repeated penalty for the lines of asterisks. That segues into the > next one: > > + Artifacts of that the algorithm counts multiples instances of "a word" > multiple times. These are baffling at first sight! The two clearest > examples: > > """ > > > Can you create and use new files with dbhash.open()? > > > > Yes. But if I run db_dump on these files, it says "unexpected file type > > or format", regardless which db_dump version I use (2.0.77, 3.0.55, > > 3.1.17) > > > > It may be that db_dump isn't compatible with version 1.85 databse files. I > can't remember. I seem to recall that there was an option to build 1.85 > versions of db_dump and db_load. Check the configure options for > BerkeleyDB to find out. (Also, while you are there, make sure that > BerkeleyDB was built the same on both of your platforms...) > > > > > > > Try running db_verify (one of the utilities built > > > when you compiled DB) on the file and see what it tells you. > > > > There is no db_verify among my Berkeley DB utilities. > > There should have been a bunch of them built when you compiled DB. I've got > these: >

5 6

Re: [Python-Dev] utf8 issue
by Michael Hudson 06 Sep '02

06 Sep '02

Guido van Rossum <guido(a)python.org> writes: > This might beling on SF, except it's already been solved in Python > 2.3, and I need guidance about what to do for Python 2.2.2. > > In 2.2.1, a lone surrogate encoded into utf8 gives an utf8 string that > cannot be decode back. In 2.3, this is fixed. Should this be fixed > in 2.2.2 as well? I think this was discussed really quite a long time ago, like six months or so. > I'm asking because it caused problems with reading .pyc files: if > there's a Unicode literal containing a lone surrogate, reading the > .pyc file causes an exception: > > UnicodeError: UTF-8 decoding error: unexpected code byte > > It looks like revision 2.128 fixed this for 2.3, but that patch > doesn't cleanly apply to the 2.2 maintenance branch. Can someone > help? I think the reason this didn't get fixed in 2.2.1 is that it necessitates bumping MAGIC. I can probably dig up more references if you want. Cheers, M. -- 34. The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information. -- Alan Perlis, http://www.cs.yale.edu/homes/perlis-alan/quotes.html

4 5

Proposed Mixins for Wide Interfaces
by Raymond Hettinger 05 Sep '02

05 Sep '02

How about adding some mixins to simplify the implementation of some of the fatter interfaces? class CompareMixin: """ Given an __eq__ method in a subclass, adds a __ne__ method Given __eq__ and __lt__, adds !=, <=, >, >=. """ class MappingMixin: """ Given __setitem__, __getitem__, and keys, implements values, items, update, get, setdefault, len, iterkeys, iteritems, itervalues, has_key, and __contains__. If __delitem__ is also supplied, implements clear, pop, and popitem. Takes advantage of __iter__ if supplied (recommended). Takes advantage of __contains__ or has_key if supplied (recommended). """ The idea is to make it easier to implement these interfaces. Also, if the interfaces get expanded, the clients automatically updated. Raymond Hettinger

4 8

The first trustworthy <wink> GBayes results
by Tim Peters 04 Sep '02

04 Sep '02

Setting this up has been a bitch. All early attempts floundered because it turned out there was *some* systematic difference between the ham and spam archives that made the job trivial. The ham archive: I selected 20,000 messages, and broke them into 5 sets of 4,000 each, at random, from a python-list archive Barry put together, containing msgs only after SpamAssassin was put into play on python.org. It's hoped that's pretty clean, but nobody checked all ~= 160,000+ msgs. As will be seen below, it's not clean enough. The spam archive: This is essentially all of Bruce Guenter's 2002 spam collection, at <http://www.em.ca/~bruceg/spam/>. It was broken at random into 5 sets of 2,750 spams each. Problems included: + Mailman added distinctive headers to every message in the ham archive, which appear nowhere in the spam archive. A Bayesian classifier picks up on that immediately. + Mailman also adds "[name-of-list]" to every Subject line. + The spam headers had tons of clues about Bruce Guenter's mailing addresses that appear nowhere in the ham headers. + The spam archive has Windows line ends (\r\n), but the ham archive plain Unix \n. This turned out to be a killer clue(!) in the simplest character n-gram attempts. (Note: I can't use text mode to read msgs, because there are binary characters in the archives that Windows treats as EOF in text mode -- indeed, 400MB of the ham archive vanishes when read in text mode!) What I'm reporting on here is after normalizing all line-ends to \n, and ignoring the headers *completely*. There are obviously good clues in the headers, the problem is that they're killer-good clues for accidental reasons in this test data. I don't want to write code to suppress these clues either, as then I'd be testing some mix of my insights (or lack thereof) with what a blind classifier would do. But I don't care how good I am, I only care about how well the algorithm does. Since it's ignoring the headers, I think it's safe to view this as a lower bound on what can be achieved. There's another way this should be a lower bound: def tokenize_split(string): for w in string.split(): yield w tokenize = tokenize_split class Msg(object): def __init__(self, dir, name): path = dir + "/" + name self.path = path f = file(path, 'rb') guts = f.read() f.close() # Skip the headers. i = guts.find('\n\n') if i >= 0: guts = guts[i+2:] self.guts = guts def __iter__(self): return tokenize(self.guts) This is about the stupidest tokenizer imaginable, merely splitting the body on whitespace. Here's the output from the first run, training against one pair of spam+ham groups, then seeing how its predictions stack up against each of the four other pairs of spam+ham groups: Training on Data/Ham/Set1 and Data/Spam/Set1 ... 4000 hams and 2750 spams testing against Data/Spam/Set2 and Data/Ham/Set2 tested 4000 hams and 2750 spams false positive: 0.00725 (i.e., under 1%) false negative: 0.0530909090909 (i.e., over 5%) testing against Data/Spam/Set3 and Data/Ham/Set3 tested 4000 hams and 2750 spams false positive: 0.007 false negative: 0.056 testing against Data/Spam/Set4 and Data/Ham/Set4 tested 4000 hams and 2750 spams false positive: 0.0065 false negative: 0.0545454545455 testing against Data/Spam/Set5 and Data/Ham/Set5 tested 4000 hams and 2750 spams false positive: 0.00675 false negative: 0.0516363636364 It's a Good Sign that the false positive/negative rates are very close across the four test runs. It's possible to quantify just how good a sign that is, but they're so close by eyeball that there's no point in bothering. This is using the new Tester.py in the sandbox, and that class automatically remembers the false positives and negatives. Here's the start of the first false positive from the first run: """ It's not really hard!! Turn $6.00 into $1,000 or more...read this to find out how!! READING THIS COULD CHANGE YOUR LIFE!! I found this on a bulletin board anddecided to try it. A little while back, while chatting on the internet, I came across an article similar to this that said you could make thousands of dollars in cash within weeks with only an initial investment of $6.00! So I thought, "Yeah right, this must be a scam", but like most of us, I was curious, so I kept reading. Anyway, it said that you send $1.00 to each of the six names and address statedin the article. You then place your own name and address in the bottom of the list at #6, and post the article in at least 200 newsgroups (There are thousands) or e-mail them. No """ Call me forgiving, but I think it's vaguely possible that this should have been in the spam corpus instead <wink>. Here's the start of the second false positive: """ Please forward this message to anyone you know who is active in the stock market. See Below for Press Release xXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX Dear Friends, I am a normal investor same as you. I am not a finance professional nor am I connected to FDNI in any way. I recently stumbled onto this OTC stock (FDNI) while searching through yahoo for small float, big potential stocks. At the time, the company had released a press release which stated they were doing a stock buyback. Intrigued, I bought 5,000 shares at $.75 each. The stock went to $1.50 and I sold my shares. I then bought them back at $1.15. The company then circulated another press release about a foreign acquisition (see below). The stock jumped to $2.75 (I sold @ $2.50 for a massive profit). I then bought back in at $1.25 where I am holding until the next major piece of news. """ Here's the start of the third: """ Grand Treasure Industrial Limited Contact Information We are a manufacturer and exporter in Hong Kong for all kinds of plastic products, We export to worldwide markets. Recently , we join-ventured with a bag factory in China produce all kinds of shopping , lady's , traveller's bags.... visit our page and send us your enquiry by email now. Contact Address : Rm. 1905, Asian Trade Centre , 79 Lei Muk Rd, Tsuen Wan , Hong Kong. Telephone : ( 852 ) 2408 9382 """ That is, all the "false positives" there are blatant spam. It will take a long time to sort this all out, but I want to make a point here now: the classifier works so well that it can *help* clean the ham corpus! I haven't found a non-spam among the "false positives" yet. Another lesson reinforces one from my previous life in speech recognition: rigorous data collection, cleaning, tagging and maintenance is crucial when working with statisical approaches, and is damned expensive to do. Here's the start of the first "false negative" (including the headers): """ Return-Path: <911(a)911.COM> Delivered-To: em-ca-bruceg(a)em.ca Received: (qmail 24322 invoked from network); 28 Jul 2002 12:51:44 -0000 Received: from unknown (HELO PC-5.) (61.48.16.65) by churchill.factcomp.com with SMTP; 28 Jul 2002 12:51:44 -0000 x-esmtp: 0 0 1 Message-ID: <1604543-22002702894513952(a)smtp.vip.sina.com> To: "NEW020515" <911(a)911.COM> From: "ÖÐ¹úITÊý¾Ý¿âÍøÕ¾£¨www.itdatabase.net £©" <911(a)911.COM> Subject: ÖÐ¹úITÊý¾Ý¿âÍøÕ¾£¨www.itdatabase.net £© Date: Sun, 28 Jul 2002 17:45:13 +0800 MIME-Version: 1.0 Content-type: text/plain; charset=gb2312 Content-Transfer-Encoding: quoted-printable Content-Length: 977 =D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www=2Eitdatabase=2Enet =A3= =A9=CC=E1=B9=A9=B4=F3=C1=BF=D3=D0=B9=D8=D6=D0=B9=FAIT/=CD=A8=D0=C5=CA=D0=B3= =A1=D2=D4=BC=B0=C8=AB=C7=F2IT/=CD=A8=D0=C5=CA=D0=B3=A1=B5=C4=CF=E0=B9=D8=CA= =FD=BE=DD=BA=CD=B7=D6=CE=F6=A1=A3 =B1=BE=CD=F8=D5=BE=C9=E6=BC=B0=D3=D0=B9=D8= =B5=E7=D0=C5=D4=CB=D3=AA=CA=D0=B3=A1=A1=A2=B5=E7=D0=C5=D4=CB=D3=AA=C9=CC=A1= """ Since I'm ignoring the headers, and the tokenizer is just a whitespace split, each line of quoted-printable looks like a single word to the classifier. Since it's never seen these "words" before, it has no reason to believe they're either spam or ham indicators, and favors calling it ham. One more mondo cool thing and that's it for now. The GrahamBayes class keeps track of how many times each word makes it into the list of the 15 strongest indicators. These are the "killer clues" the classifier gets the most value from. The most valuable spam indicator turned out to be " " -- there's simply almost no HTML mail in the ham archive (but note that this clue would be missed if you stripped HTML!). You're never going to guess what the most valuable non-spam indicator was, but it's quite plausible after you see it. Go ahead, guess. Chicken <wink>. Here are the 15 most-used killer clues across the runs shown above: the repr of the word, followed by the # of times it made into the 15-best list, and the estimated probability that a msg is spam if it contains this word: testing against Data/Spam/Set2 and Data/Ham/Set2 best discrimators: 'Helvetica,' 243 0.99 'object' 245 0.01 'language' 258 0.01 ' ' 292 0.99 '>' 339 0.179104 'def' 397 0.01 'article' 423 0.01 'module' 436 0.01 'import' 499 0.01 ' ' 652 0.99 '>>>' 667 0.01 'wrote' 677 0.01 'python' 755 0.01 'Python' 1947 0.01 'wrote:' 1988 0.01 testing against Data/Spam/Set3 and Data/Ham/Set3 best discrimators: 'string' 494 0.01 'Helvetica,' 496 0.99 'language' 524 0.01 ' ' 553 0.99 '>' 687 0.179104 'article' 851 0.01 'module' 857 0.01 'def' 875 0.01 'import' 1019 0.01 ' ' 1288 0.99 '>>>' 1344 0.01 'wrote' 1355 0.01 'python' 1461 0.01 'Python' 3858 0.01 'wrote:' 3984 0.01 testing against Data/Spam/Set4 and Data/Ham/Set4 best discrimators: 'object' 749 0.01 'Helvetica,' 757 0.99 'language' 763 0.01 ' ' 877 0.99 '>' 954 0.179104 'article' 1240 0.01 'module' 1260 0.01 'def' 1364 0.01 'import' 1517 0.01 ' ' 1765 0.99 '>>>' 1999 0.01 'wrote' 2071 0.01 'python' 2160 0.01 'Python' 5848 0.01 'wrote:' 6021 0.01 testing against Data/Spam/Set5 and Data/Ham/Set5 best discrimators: 'object' 980 0.01 'language' 992 0.01 'Helvetica,' 1005 0.99 ' ' 1139 0.99 '>' 1257 0.179104 'article' 1678 0.01 'module' 1702 0.01 'def' 1846 0.01 'import' 2003 0.01 ' ' 2387 0.99 '>>>' 2624 0.01 'wrote' 2743 0.01 'python' 2864 0.01 'Python' 7830 0.01 'wrote:' 8060 0.01 Note that an "intelligent" tokenizer would likely miss that the Python prompt ('>>>') is a great non-spam indicator on python-list. I've had this argument with some of you before <wink>, but the best way to let this kind of thing be as intelligent as it can be is not to try to help it too much: it will learn things you'll never dream of, provided only you don't filter clues out in an attempt to be clever. everything's-a-clue-ly y'rs - tim

10 22