Heuristically processing documents
Hendrik van Rooyen
mail at microcorp.co.za
Fri Mar 20 04:34:48 EDT 2009
"MRAB" <google at mrett.plus.com> wrote:
BJörn Lindqvist wrote:
8< ---------------------------
>> For example, to find the email you can use a simple regexp. If there
>> is a match you can be certain that that is the authors email. But what
>> algorithms can you use to figure out the other information?
>>
>Tricky! :-)
>
>How would _you_ recognise them? Have a look at the documents and see if
>you can see a pattern. For example, names and address often consist of a
>sequence of words in title case, eg "Björn Lindqvist", which might help
>you narrow down the list of possibilities. What do telephone numbers
>look like, etc?
It may help you to think about the problem if you imagine yourself having
to extract the information from documents written in a language that
you do not understand.
An address may be identified by a number in a line (street address or PO box)
that is followed some lines later by another number (zip code).
But this hardly qualifies as an "algorithm".
A "mailto:" and/or a set of "angle brackets" is a strong clue too...
Don't have a clue about the name, though. - plain title case might
work for "John Brown" but it fails with "Koos van der Merwe".
If there is an email addy in the doc, then it might serve as a clue
to where to look - based on the theory that the contact information
would be grouped together.
Another clue might be to look for the word "Author" or its
equivalent in a bunch of languages.
"Tricky" is an understatement.
- Hendrik
More information about the Python-list
mailing list