[Tutor] extracting matches by paragraph

Wed Oct 11 12:06:46 CEST 2006

Thomas A. Schmitz wrote:
> Please excuse this long mail. I have read several tutorials and  
> googled for
> three days, but haven't made any real progress on this question,  
> probably
> because I'm an absolute novice at python. I'd be very grateful for  
> some help.
> 
> 1. My problem:
> 
> I have several files in a structured database format. They
> contain entries like this:
> 
> Type de notice : monographie
> Auteur(s) : John Doe
> Titre(s) : Argl bargl
> Publication : Denver, University of Colorado Press, 1776
> 
> Type de notice : article
> Auteur(s) : Richard Doe
> Titre(s) : wurgl burgl
> 
> Type de notice : recueil
> Titre(s) : orgl gorgl
> 
> I want to translate this into a BibTeX format. My approach was to  
> read the file
> in by paragraphs, then extract the values of the fields that interest  
> me and
> write these values to another file. I cannot go line by line since I  
> want to
> reuse, e.g., the value of the "Auteur(s)" and "Titre(s)" fields to  
> generate a
> key for every item, in the form of "doeargl" or "doewurgl" (via the  
> split and
> join functions) The problem is that not every entry contains every  
> field (in my
> example, #3 doesn't have an author), so I guess I need to test for the
> existence of these fields before I can use their values.
> 
> 2. The approach:
> 
> There is code here
> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/408996/
> which allows to read a file by paragraphs. I copied this to my script:
> 
> class FileIterator(object):
>     """ A general purpose file object iterator cum
>     file object proxy """
> 
>     def __init__(self, fw):
>         self._fw = fw
> 
>     def readparagraphs(self):
>         """ Paragraph iterator """
> 
>         # This re-uses Alex Martelli's
>         # paragraph reading recipe.
>         # Python Cookbook 2nd edition 19.10, Page 713
>         paragraph = []
>         for line in self._fw:
>             if line.isspace():
>                 if paragraph:
>                     yield "".join(paragraph)
>                     paragraph = []
>             else:
>                 paragraph.append(line)
>         if paragraph:
>             yield "".join(paragraph)
> 
> When I now run a very basic test:
> 
> for item in iter.readparagraphs():
>      print item
> 
> The entire file is reprinted paragraph by paragraph, so this code  
> appears to work.

I would take out the join in this, at least, and return a list of lines. 
You don't really have a paragraph, you have structured data. There is 
not need to throw away the structure.

It might be even more useful to return a dictionary that maps field 
names to values. Also there doesn't seem to be any reason to make 
FileIterator a class, you can use just a generator function (Dick Moores 
take notice!):

def readparagraphs(fw):
         self._fw = fw

         data = {}
         for line in fw:
             if line.isspace():
                 if data:
                     yield data
                     data = {}
             else:
                 key, value = line.split(' : ')
                 data[key] = value
         if data:
             yield data

Now you don't need a regexp, you have usable data directly from the 
iterator.

> I can also match the first line of every paragraph like so:
> 
> reBook = re.compile('^Type de notice : monographie')
> for item in iter.readparagraphs():
>      m1 = reBook.match(item)
>      if m1:
> 	print "@Book{,"
> 
> this will print a line @Book{, for every "monographie" in the  
> database -- a
> good start, I thought!
> 
> 3. The problem that's driving me insane
> 
> But as soon as I try to match anything inside of the paragraph:
> 
> reAuthor = re.compile('^Auteur\(s\) : (?P<author>.+)$')
>     m2 = reAuthor.match(item)
>     if m2:
>         author = m2.group('author')
>         print "author = {%s}," % author
> 
> I get no matches at all. I have tried to remove the ^ and the $ from  
> the regex,
> or to add the "re.DOTALL" flag, but to no avail.

You need re.MULTILINE to modify the meaning of ^ and $. re.DOTALL 
affects whether . matches newlines.
> 
> 4. My aim
> 
> I would like to have dictionary with fixed keys (the BibTeX field)  
> and values
> extracted from my file for every paragraph and then write this, in a  
> proper
> format, to a bibtex file. If a paragraph does not provide a value for a
> particular key, I could then, in a second pass over the bibtex file,  
> delete
> these lines. 

I would write the code to exclude those lines in the first place. If the 
dict returned from readparagraphs() is missing a key, then don't write 
the corresponding line.

Kent