[Tutor] extract meaningful data from garbage
alan.gauld at btinternet.com
Sun Jan 3 09:56:22 CET 2010
"Shashwat Anand" <anand.shashwat at gmail.com> wrote
> here are the examples : http://codepad.org/wF8APZV3
>> I need to extract some meaningful data from grabages.
>> How should I parse info if I'm not certain of any definite rules. This is
>> my first time dealing with real-life data.
Unfortunarely to parse it you will need to define a set of rules.
The company name seems to consistently follow a line like
1511261 - 08/12/2006
So that should be relatively easy to extract. However the
address data seems much more random.
Also you don't say how you want to trweat the alternative
names/addresses (eg "trading as...")
As a first attempt the address follows immediately
after the company name except
1) when the next line begins with "trading as" or
2) the next line begins with "(" in which case the address
follows the closing "("
Those two rules are sufficient for the 4 examples you posted.
You may have other cases where they break down.
The address seems to consistently stop on theline above
MANUFACTURER...But again other data samples may show cases where that breaks down. You will need to create more rulesto cover those cases.Unfortunately the address data itself is not consistentwhich makes it difficult to define a rule to recognise it in its own right.HTH-- Alan GauldAuthor of the Learn to Program web sitehttp://www.alan-g.me.uk/
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Tutor