Searching a binary file -- More concrete instance of generalized searching problem
Samuel Scarano
srs25 at cornell.edu
Tue Jul 25 11:50:37 EDT 2000
Me again. I just thought of a good example that illustrates what I'm
thinking of.
Suppose the string you're matching against comes in incrementally -- as
lines of text to stdin, or perhaps a TCP stream. The total length can be
arbitrarily large, so you don't want to just concatenate incoming data to a
perpetually growing string. And the match may contain the first character,
so you can't just discard old data (e.g., matching 'a.*b' against a string
that starts with 'a' and whose first 'b' is at the billionth position).
I know its *possible* -- at least with "pure" regular expressions -- because
they define regular languages, and membership in a regular language can be
determined by scanning a string from left to right without storing a single
character -- all that must be stored is the state of the DFA corresponding
to the regex.
Mmmm... perhaps I should forget all this just impose a bound on the match
length. But even then, every time you get a new piece of the string in, you
have to re-search that overlap -- that's a *lot* of redundant searching. Too
ugly. A regex engine that saves its state would be very useful I think....
Samuel Scarano wrote:
>
> Interesting problem. I'd like to propose a generalized one which I may be
> facing soon:
>
> Suppose you want to search with an *arbitrary* regular expression. In this
> case, you don't know how long the matching string may be, so you don't know
> how much overlap you need between buffers.
>
> Is it necessary, perhaps, to use a constant buffer-overlap size, thus
> placing a bound on the length of a matched string?
>
> Just to make it harder, suppose the buffers are numerous and arbitrarily
> small -- sometimes smaller than the matched-string length bound.
>
> Is there a way to get the regular expression engine to save its state at the
> end of one buffer and pick up where it left off on the next? I'm almost
> certain that this is possible (at least with the sort of naive DFA-based
> regex implementation that I have in mind), but has such functionality been
> made available in python?
>
> luthi at vaw.baug.ethz.NOSPAM.ch wrote:
> >
> > What is the fastest way to search a binary file for a certain byte pattern
> > (**** in my case)?
> >
> > I came up with this solution, but I guess there is a better way to do
> > it. Thanks for all suggestions.
> >
> > Martin Luethi
> >
> > =====
> >
> > # constants
> > MaxFileSizeInMemory = 10*1000*1000
> >
> > file = open('my_binary_file', 'b')
> >
> > file.seek(0,2) # set the pointer to the end of the file
> > filesize = file.tell() # get the size of the file
> > file.seek(0) # reset the file pointer
> > rex = re.compile(r'[\*]{4}') # the bytes to search for
> > starpointer = [] # position of the '****' in the file
> > oldpos = 0
> > for i in range(filesize/MaxFileSizeInMemory + 1):
> > data = self.file.read(MaxFileSizeInMemory) # read a chunk of bytes
> > m = rex.search(data)
> > while m: # find all '****' in this chunk
> > pos = m.start() - 4 # corrected for the length of '****'
> > incpointer.append(pos + i*MaxFileSizeInMemory)
> > m = rex.search(data, pos + 10)
> > oldpos = pos
> > file.close() # close the file
> >
> > --
> > ============================================================
> > Martin Luethi Tel. +41 1 632 40 92
> > VAW ETH Zuerich
> > CH-8092 Zuerich mail luthi at vaw.baum.ethz.ch
> > Switzerland
> > ============================================================
>
> --
> Samuel R. Scarano "Due to circumstances beyond my
> undergraduate, Cornell University control, I am master of my fate
> <srs25 at cornell.edu> and captain of my soul."
> http://people.cornell.edu/pages/srs25/ -- Ashleigh Brilliant
--
Samuel R. Scarano "Due to circumstances beyond my
undergraduate, Cornell University control, I am master of my fate
<srs25 at cornell.edu> and captain of my soul."
http://people.cornell.edu/pages/srs25/ -- Ashleigh Brilliant
More information about the Python-list
mailing list