[Tutor] extract meaningful data from garbage

Shashwat Anand anand.shashwat at gmail.com
Sun Jan 3 16:33:33 CET 2010


I am almost doing same thing i.e. to give the values left unparsed a certain
name - 'NIL', and currently I'm redirecting output to a text file. Searching
for 'NIL' tells me where my match failed, although writing it seperately to
a different file dint occurred to me. And yes the job is to reduce as much
manual work as possible, I got it now. Thanks for the help :)

~Shashwat

On Sun, Jan 3, 2010 at 8:24 PM, Alan Gauld <alan.gauld at btinternet.com>wrote:

> "Shashwat Anand" <anand.shashwat at gmail.com> wrote
>
>  as to match all of them. The task is time-consuming but with every new
>> test-sets exceptions are becoming less and less. (There are .2 million
>> such
>> pages)
>>
>
> One final thing to try is to identify records where you *failed* to find
> a match and re write them into an error file. The error file can then
> be manually processed if need be.
>
> You might also be able to clean up the error file by not writing lines
> that you know to be non-useful. The resultant error file might then
> show up some further patterns that you can exploit.
>
> Its all about eliminating as much manual effort as possible and
> making the manual work that is left over as easy as possible.
> ie Accept that you won't ever get 100% success and aim to
> minimise the pain as much as possible.
>
>
>
> HTH,
>
>
> --
> Alan Gauld
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100103/a785fb90/attachment.htm>


More information about the Tutor mailing list