[Tutor] extract meaningful data from garbage

Alan Gauld alan.gauld at btinternet.com
Sun Jan 3 15:54:35 CET 2010

"Shashwat Anand" <anand.shashwat at gmail.com> wrote

> as to match all of them. The task is time-consuming but with every new
> test-sets exceptions are becoming less and less. (There are .2 million 
> such
> pages)

One final thing to try is to identify records where you *failed* to find
a match and re write them into an error file. The error file can then
be manually processed if need be.

You might also be able to clean up the error file by not writing lines
that you know to be non-useful. The resultant error file might then
show up some further patterns that you can exploit.

Its all about eliminating as much manual effort as possible and
making the manual work that is left over as easy as possible.
ie Accept that you won't ever get 100% success and aim to
minimise the pain as much as possible.


Alan Gauld
Author of the Learn to Program web site

More information about the Tutor mailing list