[Tutor] Processing Gutenburg texts

Daniel Yoo dyoo@hkn.EECS.Berkeley.EDU
Fri, 11 Aug 2000 03:35:41 -0700 (PDT)


On Thu, 10 Aug 2000, Simon Brunning wrote:
> into a monster, and I'd like to simplify things.

Let's try to look at code locally, and see if we can get the code a little
simpler.  However, I have to admit I'm not too good at this stuff -- My
own code bulges out a lot... You might want to look at your program from
both a low-level and high-level perspective, bounce them around a bit, and
see what you'd like to improve.


>     def importText(self, gutFile):
>         importFile = open(gutFile, 'r')
>         while(1):
>             inLine = importFile.readline()
>             if inLine == '': break # EOF
>             self.storeFragment(inLine)
>         if len(self.textBlocks) == 0: # Empty, end of front-matter not
> found.
>             self.textBlocks.append('No text found.')

Instead of the 'while' loop, using a 'for/in' might be a little easier:

###
def importText(self, gutFile):
  for line in open(gutFile).readlines():
    self.storeFragment(inLine)
  if len(self.textBlocks) == 0:
    self.textBlocks.append('No text found.')
###

Because of garbage collection, we don't have to worry too much on
explicitly keeping a reference to that open file, so it simplifies the
code a bit.


I see that storeFragment() handles 3 separate cases: capturing front
material, doing regular stuff, and finishing paragraphs. It might be
easier to split off the front material searching from the rest of your
fragment storing, since it seems to be different in spirit from the other
two tasks.  Use function decomposition liberally to make things easy to
read.

From a big perspective, what might be making your program a little long is
the line-by-line analysis that is being done.  It might be easier to split
off the chapters and other book sections if you do string manipulations on
the whole text file (read() instead of readlines()).  From this big
perspective, your program, too, looks at the input in a big way.  *grin*

We can say that a document is made up of chunks.  A chunk is either the

  front material: everything before that *end* tag.
  sectioning: any one of your tocTriggers
  regular text: anything else.

You can do this in stages.  First chomp off the front material.  With
everything else, you can start chunking along the sectioning keywords.  
Finally, you can do local adjustments, like string.strip(), to clean the
small things up.

I made a small chunker that might help.  Don't laugh, it's really bad.

###
def makeChunks(msg, trigger):
  """msg -> list of chunks.  For example:

    makeChunks("This is a short message", " ")   ->
    ['This', ' ', 'is', ' ', 'a', ' ', 'short', ' ', 'message']
"""
  sentinel = '\0'
  msg = string.replace(msg, trigger, sentinel+trigger+sentinel)
  return string.split(msg, sentinel)
###

Ok, it's hideous.  I admit it.  I'm sleepy.  *grin* But perhaps it might
be useful for you.  makeChunks() is a little different from a regular
split() because it maintains the splitting element inside the list, in
preparation for a scan-through later for key phrases (like sectioning
names.)  I'm abusi...er... using the null character, becase I'm assuming
that it'll never show up naturally in a text file.


For your exportHTML(), you should probably split up sections of it into
other functions.  For example, splitting the table of contents section off
in another function will probably make things better.  It doesn't matter
that it just gets called once --- the idea is to help you see lots of
small steps reduced to one large step.

From a fast grep through, it looks like certain "phrases" are being
repeated --- they might be good candidates for functions, and will reduce
its size as well.

I gotta go before I konk out to sleep.  I hope this is somewhat
helpful.  Good luck!