John Hunter jdhunter at
Thu Sep 5 19:10:24 CEST 2002

I am writing a plex lexer/scanner to parse a pdf file.  Here is the
first part, which extracts the streams.

I would like to do this a bit more efficiently, namely, to read the
streams in multicharacter chunks rather than one character at a time.
As it is, the function add_stream has to be called for every character
in the stream.  

Any advice how to do this?

Here is the code:

from Plex import *

def add_stream(scanner, text):
    scanner.thisStream += text

def end_stream(scanner, text):
    print 'BeginStream: ', scanner.thisStream, 'EndStream:'  # do something with the stream here
    scanner.thisStream = ''

lexicon = Lexicon([
  (AnyChar, IGNORE),
  (Bol + Str("stream
") + Eol, Begin('stream')),
  State('stream' , [
    ( Bol + Str("endstream") + Eol, end_stream ), 
    ( AnyChar, add_stream),

filename = "test.pdf"
f = open(filename, "r")
scanner = Scanner(lexicon, f, filename)
scanner.thisStream = ''

while 1:
  token =
  if token[0] is None:

John Hunter

More information about the Python-list mailing list