Regular Expression AND mach

Fri Mar 26 03:47:47 EST 2004

wcarus at comcast.net (Win Carus) wrote in message news:<fc91898a.0403250748.5ceb5a41 at posting.google.com>...
> michael at foord.net (Fuzzyman) wrote in message news:<8089854e.0403220021.db39215 at posting.google.com>...
> > "Robert Brewer" <fumanchu at amor.org> wrote in message news:<mailman.194.1079807509.742.python-list at python.org>...
> > > Fuzzyman wrote:

[snip..]

> > > 
> > 
> > Thanks for your reply, it was both helpful and interesting.
> > Unfortunately it only confirmed my suspicion that what I wanted to do
> > wasn't possible :-(
> > 
> > The database 'module' I'm using (which is basically fine for my other
> > purposes) does searches through it's records using regular
> > expressions. A full search through 1800 records (each record can be a
> > couple of k of text) takes about 0.2 seconds - which is fine.... but
> > doing 4 or 5 searches and comparing results *isn't* (0.2 seconds delay
> > is ok - 0.8-1.0 seconds isn't).... So I'm currently just searching for
> > the longest word - KirbBase then returns the full *text* of each song
> > containing that word... and I'm just checking each song to see if it
> > has the other words :-)
> > 
> > This works fine, isn't noticeably slow, but isn't as elegant as I'd
> > hoped...
> > 
> > Regards,
> > 
> > 
> > Fuzzy
> > 
> > http://www.voidspace.org.uk/atlantibots/pythonutils.html
> > 
> > > 
> > > HTCYTIRABM!
> > > 
> > > Robert Brewer
> > > MIS
> > > Amor Ministries
> > > fumanchu at amor.org
> 
> Hi Fuzzyman.
> 
> The heuristic solution you've come up with (i.e, searching using the
> longest term and then filtering/ranking the results against the
> remaining terms) is in fact quite well founded in practice.  Many
> years ago, Miller and Newman determined that there are two
> length-frequency statistical patterns in written English, one for
> content words and another for function words. The content-word pattern
> is regular and demonstrates a very strong inverse relationship between
> word length and word frequency (i.e., the longer the word, the less
> frequent). There are, of course, many exceptions, but the pattern is
> very strong; and there are interesting information theoretic arguments
> why this makes sense (if you accept that human languages operate to
> minimize the costs of communication). In the absence of statistics
> collected on your specific body of text, this is a very useful crutch.
> 
> Miller, George A. and Edwin B. Newman. 1958. "Tests of a Statistical
> Explanation of the Rank-Frequency Relation for Word in Written
> English." American Journal of Psychology (71): 209-258.
> 
> Miller, George A., E. B. Newman, and E. A. Friedman. 1958.
> "Length-Frequency Statistics for Written English." Information and
> Control (1): 370-389.
> 
> On the other hand, it is also possible to (a) identify the full set of
> candidate search terms available (this would allow you to fail quickly
> on search terms that can't be found); and (b) determine their
> frequencies (this would allow you to search for terms in inverse
> frequency, if that's what you really want). Creating an inverted index
> is, of course, a possibility, but text search is so fast and,
> furthermore, you don't have the complexities of creating and
> maintaining the index itself.  Wu and Manber, the developers of the
> 'agrep' tool (which uses some really fancy and powerful partial
> matching techniques and is used by the Glimpse and Webglimpse search
> engines), have argued precisely this point.
> 
> Sun Wu and Udi Manber: Fast Text Searching Allowing Errors.
> Communications of the ACM, pp. 83-91, Vol. 35, No. 10, Oct. 1992, USA.
> 
> Sun Wu and Udi Manber: AGREP - A Fast Approximate Pattern-matching
> Tool. Proceedings of the Winter 1992 USENIX Conference San Francisco,
> USA, 20.-24. Jan. 1992, pp. 153-162, Berkeley, USA, 1991.
> 
> Udi Manber and Sun Wu: Approximate Pattern Matching. BYTE pp. 281-292,
> November 1992.
> 
> There is a Python port of agrep available as a module called 'agrepy'
> at the following site:
> 
> http://www.bio.cam.ac.uk/~mw263/pyagrep.html
> 
> As noted in an earlier post, regexes alone can't do what you want.
> There are other techniques that might do what you're looking for, for
> example, Aho-Corasick (and some other) automata.  Aho-Corasick
> automata are able to 'parallelize' pattern matching so that multiple
> strings can be matched against a text in a single pass. Aho-Corasick
> automata are easy to implement in both their deterministic and
> non-deterministic forms in Python.
> 
> Aho, A., and Corasick, M. Efficient String Matching: An Aid to
> Bibliographic Search. Commun. ACM 18, 6 (June 1975), 333--340.
> 
> http://www-sr.informatik.uni-tuebingen.de/~buehler/AC/html/Automaton.html
> 
> Good luck,
> 
> Win

Interesting - and thanks for your help.

Regards,

Fuzzy

http://www.voidspace.org.uk/atlantibots/pythonutils.html