Heuristically processing documents
google at mrabarnett.plus.com
Thu Mar 19 20:24:37 CET 2009
BJörn Lindqvist wrote:
> I have a large set of documents in various text formats. I know that
> each document contains its authors name, email and phone number.
> Sometimes it also contains the authors home address.
> The task is to find out the name, email and phone of as many documents
> as possible. Since the documents are not in a specific format, you
> have to do a lot of guessing and getting approximate results is fine.
> For example, to find the email you can use a simple regexp. If there
> is a match you can be certain that that is the authors email. But what
> algorithms can you use to figure out the other information?
How would _you_ recognise them? Have a look at the documents and see if
you can see a pattern. For example, names and address often consist of a
sequence of words in title case, eg "Björn Lindqvist", which might help
you narrow down the list of possibilities. What do telephone numbers
look like, etc?
More information about the Python-list