Heuristically processing documents

Thu Mar 19 15:24:37 EDT 2009

BJörn Lindqvist wrote:
> I have a large set of documents in various text formats. I know that
> each document contains its authors name, email and phone number.
> Sometimes it also contains the authors home address.
> 
> The task is to find out the name, email and phone of as many documents
> as possible. Since the documents are not in a specific format, you
> have to do a lot of guessing and getting approximate results is fine.
> 
> For example, to find the email you can use a simple regexp. If there
> is a match you can be certain that that is the authors email. But what
> algorithms can you use to figure out the other information?
> 
Tricky! :-)

How would _you_ recognise them? Have a look at the documents and see if
you can see a pattern. For example, names and address often consist of a
sequence of words in title case, eg "Björn Lindqvist", which might help
you narrow down the list of possibilities. What do telephone numbers
look like, etc?