andrew-news at andros.org.uk
Fri Sep 15 01:31:02 CEST 2006
John Machin wrote:
> A quick silly question: what is the problem that you are trying to
A fair question :-)
The problem may seem a bit strange, but here it is:
I have the ability to query a database in a legacy system and extract
records which match a particular pattern. Specifically, I can perform
queries for records that contain a given search term as a sub-string of
a particular column. The specific column contains an address. This
database can only be accessed through this particular interface (don't
ask why, it's one of the reasons it's a *legacy* system).
I also have access to a list that contains the vast majority (possibly
all) the addresses which are stored in the database.
Now I want to issue a series of queries, such that when I combine all
the data returned I have accessed all the records in the database.
However, I want to minimise the total number of queries and also want to
keep the number of records returned by more than one query small.
Now the current approach I use is to divide the addresses I have into
tokens and take the last token in the address (excluding the postal
code). The union of these "last tokens" forms my set of queries. The
last token in the address is typically a county or a town in a UK address.
This works, but I was wondering if I could do something more efficient.
The problem is that while the search term "London" matches all the
addresses in London it also returns all the addresses containing "London
Road", and a lot of towns have a London Road. Perhaps I would be better
off searching for "Road", "Street", "Avenue" ....
It occurred to me that this my be isomorphic to a known problem, however
given that I want to keep two things small, the problem isn't very well
The current approach works, I was just musing whether there was a faster
approach, so don't think about it too hard.
More information about the Python-list