Buffer pair for lexical analysis of raw binary data

Sat Jun 27 12:02:43 EDT 2009

Partly as an educational exercise, and partly for its practical
benefit, I'm trying to pick up a programming project from where
I left off in 2001.  It implemented in slightly generalised form
the "buffer pair" scheme for lexical analysis described on pp.
88--92 of Aho et al., /Compilers: Principles, Techniques and 
Tools/ (1986). (I'm afraid I don't have a page reference for the
2007 second edition.  Presumably it's also in Knuth somewhere.)

Documentation for one of the C++ header files describes it thus
(but I never quite got the hang of C++, so some of the language-
specific details may be very poorly conceived):

"An <ipfile> object incorporates a handle to a file, opened in 
read-only mode, and a buffer containing (by default) raw binary
data from that file. The constructor also has an option to open
a file in text mode.

The buffer may, optionally, consist of several segments, linked
to one another in cyclic sequence. The number of segments is a
constant class member, nblocks (1 <= nblocks <= 32,767). A second
constant class member, block (1 <= block <= 32,767) gives the size
of each of the segments in bytes.

The purpose of creating a buffer in cyclically linked segments
is to allow reference to the history of reading the file, even
though it is being read sequentially. The bare class <ipfile> 
does not do this itself, but is designed so that classes derived
from it may incorporate one or more pointers to parts of the buffer
that have already been read (assuming these parts have not yet been
overwritten).

If there were only one segment, the length of available history
would periodically be reduced to zero, when the buffer is re-
freshed. In general, the available history occupies at least 
a fraction (nblocks - 1)/nblocks of a full buffer."

Aho et al. describe the scheme thus (p. 90):

"Two pointers to the input buffer are maintained.  The string
of characters between the two pointers is the current lexeme.
Initially, both pointers point to the first character of the
next lexeme to be found.  One, called the forward pointer, scans
ahead until a match for a pattern is found.  Once the next lexeme
is determined, the forward pointer is set to the character at
its right end.  After the lexeme is processed, both pointers
are set to the character immediately past the lexeme."

[There follows a description of the use of "sentinels" to test
efficiently for pointers moving past the end of input to date.]

I seem to remember (but my memory is still very hazy) that there
was some annoying difficulty in coding the raw binary input file
reading operation in C++ in an implementation-independent way;
and I'm reluctant to go back and perhaps get bogged down again
in whatever way I got bogged down before; so I would prefer to
use Python for the whole thing, if possible (either using some
existing library, or else by recoding it all myself in Python).

Does some Python library already provide some functionality like
this?  (It's enough to do it with nblocks = 2, as in Aho et al.)

If not, is this a reasonable thing to try to program in Python?
(At the same time as learning the language, and partly as a
fairly demanding exercise intended to help me to learn it.)

Or should I just get my hands dirty with some C++ compiler or
other, and get my original code working on my present machine
(possibly in ANSI C instead of C++), and call it from Python?

-- 
Angus Rodgers