[Tutor] extract meaningful data from garbage

spir denis.spir at free.fr
Sun Jan 3 18:47:52 CET 2010

Shashwat Anand dixit:

> @Alan, @Lie thanks
> The approach which I am taking right now is taking some test-cases, and
> creating rules for them. Later on after expanding the cases there aroused
> some cases which didn't followed earlier pattern so I tweaked some rules so
> as to match all of them. The task is time-consuming but with every new
> test-sets exceptions are becoming less and less. (There are .2 million such
> pages)
> PS. The task is to create a trademark-database which stores ID, company
> name, date, address, and trademarks from the original set and later matches
> with the given trademarks to disqualify similar trademarks.

Sometimes it is worthful to note patterns for not-to-be-kept parts of source (garbage). Esp. if you can find patterns for start/end of garbage parts.
Eg if address end is hard, look whether it's easier to find a pattern for the start of the following garbage.


la vita e estrany


More information about the Tutor mailing list