Algorithm Question

Thu Sep 14 19:31:02 EDT 2006

John Machin wrote:
> A quick silly question: what is the problem that you are trying to
> solve?

A fair question :-)

The problem may seem a bit strange, but here it is:

I have the ability to query a database in a legacy system and extract 
records which match a particular pattern. Specifically, I can perform 
queries for records that contain a given search term as a sub-string of 
a particular column. The specific column contains an address. This 
database can only be accessed through this particular interface (don't 
ask why, it's one of the reasons it's a *legacy* system).

I also have access to a list that contains the vast majority (possibly 
all) the addresses which are stored in the database.

Now I want to issue a series of queries, such that when I combine all 
the data returned I have accessed all the records in the database. 
However, I want to minimise the total number of queries and also want to 
keep the number of records returned by more than one query small.

Now the current approach I use is to divide the addresses I have into 
tokens and take the last token in the address (excluding the postal 
code). The union of these "last tokens" forms my set of queries. The 
last token in the address is typically a county or a town in a UK address.

This works, but I was wondering if I could do something more efficient. 
The problem is that while the search term "London" matches all the 
addresses in London it also returns all the addresses containing "London 
Road", and a lot of towns have a London Road. Perhaps I would be better 
off searching for "Road", "Street", "Avenue" ....

It occurred to me that this my be isomorphic to a known problem, however 
given that I want to keep two things small, the problem isn't very well 
defined.

The current approach works, I was just musing whether there was a faster 
approach, so don't think about it too hard.

- Andrew