[Tutor] extract meaningful data from garbage

Shashwat Anand anand.shashwat at gmail.com
Sun Jan 3 12:46:21 CET 2010


@Alan, @Lie thanks
The approach which I am taking right now is taking some test-cases, and
creating rules for them. Later on after expanding the cases there aroused
some cases which didn't followed earlier pattern so I tweaked some rules so
as to match all of them. The task is time-consuming but with every new
test-sets exceptions are becoming less and less. (There are .2 million such
pages)

PS. The task is to create a trademark-database which stores ID, company
name, date, address, and trademarks from the original set and later matches
with the given trademarks to disqualify similar trademarks.

On Sun, Jan 3, 2010 at 2:53 PM, Lie Ryan <lie.1296 at gmail.com> wrote:

> On 1/3/2010 4:58 PM, Shashwat Anand wrote:
>
>> I need to extract some meaningful data from grabages.
>> Here are four examples. I need to get date, company name and address
>> from these.
>> For date i used regex but I'm unable to find any definite pattern for
>> address and company name
>> the format is more or less :
>> garbage
>> id - date
>> garbage
>> company name
>> garbage
>> company address
>> garbage
>>
>> How should I parse info if I'm not certain of any definite rules. This
>> is my first time dealing with real-life data.
>>
>
> Other than the "id - date"; it seems quite difficult to reliably extract
> the company names and addresses. Extracting the company names and addresses
> appears to be based on a best-effort basis.
>
> Tips: look for clue keywords; company names often ends with
> ltd/sdn/bhd/berhad; lines that starts with "address" often is followed by
> the actual addresses; etc.
>
> Tips: this is a good showcase for TDD; pick twenty-or-so cases and manually
> extract the information and write your program to match as much of these
> test cases as possible (while manually extracting you should be able to
> notice additional patterns that you can use later on while writing your
> program).
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100103/5abcb4c6/attachment.htm>


More information about the Tutor mailing list