[Tutor] extract meaningful data from garbage
Lie Ryan
lie.1296 at gmail.com
Sun Jan 3 10:23:18 CET 2010
On 1/3/2010 4:58 PM, Shashwat Anand wrote:
> I need to extract some meaningful data from grabages.
> Here are four examples. I need to get date, company name and address
> from these.
> For date i used regex but I'm unable to find any definite pattern for
> address and company name
> the format is more or less :
> garbage
> id - date
> garbage
> company name
> garbage
> company address
> garbage
>
> How should I parse info if I'm not certain of any definite rules. This
> is my first time dealing with real-life data.
Other than the "id - date"; it seems quite difficult to reliably extract
the company names and addresses. Extracting the company names and
addresses appears to be based on a best-effort basis.
Tips: look for clue keywords; company names often ends with
ltd/sdn/bhd/berhad; lines that starts with "address" often is followed
by the actual addresses; etc.
Tips: this is a good showcase for TDD; pick twenty-or-so cases and
manually extract the information and write your program to match as much
of these test cases as possible (while manually extracting you should be
able to notice additional patterns that you can use later on while
writing your program).
More information about the Tutor
mailing list