[Tutor] extract meaningful data from garbage

ALAN GAULD alan.gauld at btinternet.com
Sun Jan 3 17:02:54 CET 2010


The other advantage of an error file is that if 
your rules become very complex it's a lot faster 
to only apply them to the entries that didn't 
match the simpler rules.

In other words instead of applying a complex 
set of rules to every one of 2 million entries 
you might only have to apply them to 
200 thousand... much faster to process.


You might well wind up with several processing 
runs before spitting out the final human 
readable error file for hand processing.

Alan Gauld
Author of the Learn To Program website
http://www.alan-g.me.uk/





________________________________
From: Shashwat Anand <anand.shashwat at gmail.com>
To: Alan Gauld <alan.gauld at btinternet.com>
Cc: tutor at python.org
Sent: Sunday, 3 January, 2010 15:33:33
Subject: Re: [Tutor] extract meaningful data from garbage

I am almost doing same thing i.e. to give the values left unparsed a certain name - 'NIL', and currently I'm redirecting output to a text file. Searching for 'NIL' tells me where my match failed, although writing it seperately to a different file dint occurred to me. And yes the job is to reduce as much manual work as possible, I got it now. Thanks for the help :)

~Shashwat


On Sun, Jan 3, 2010 at 8:24 PM, Alan Gauld <alan.gauld at btinternet.com> wrote:

"Shashwat Anand" <anand.shashwat at gmail.com> wrote
>
>
>>>as to match all of them. The task is time-consuming but with every new
>>>>test-sets exceptions are becoming less and less. (There are .2 million such
>>>>pages)
>>
>
>One final thing to try is to identify records where you *failed* to find
>>a match and re write them into an error file. The error file can then
>>be manually processed if need be.
>
>>You might also be able to clean up the error file by not writing lines
>>that you know to be non-useful. The resultant error file might then
>>show up some further patterns that you can exploit.
>
>>Its all about eliminating as much manual effort as possible and
>>making the manual work that is left over as easy as possible.
>>ie Accept that you won't ever get 100% success and aim to
>>minimise the pain as much as possible.
>
>
>
>>HTH,
>
>
>>-- 
>>Alan Gauld
>>Author of the Learn to Program web site
>http://www.alan-g.me.uk/ 
>
>
>>_______________________________________________
>>Tutor maillist  -  Tutor at python.org
>>To unsubscribe or change subscription options:
>http://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100103/0d4f1417/attachment-0001.htm>


More information about the Tutor mailing list