[Tutor] extracting matches by paragraph

Thomas A. Schmitz thomas.schmitz at uni-bonn.de
Wed Oct 11 10:48:32 CEST 2006


Please excuse this long mail. I have read several tutorials and  
googled for
three days, but haven't made any real progress on this question,  
probably
because I'm an absolute novice at python. I'd be very grateful for  
some help.

1. My problem:

I have several files in a structured database format. They
contain entries like this:

Type de notice : monographie
Auteur(s) : John Doe
Titre(s) : Argl bargl
Publication : Denver, University of Colorado Press, 1776

Type de notice : article
Auteur(s) : Richard Doe
Titre(s) : wurgl burgl

Type de notice : recueil
Titre(s) : orgl gorgl

I want to translate this into a BibTeX format. My approach was to  
read the file
in by paragraphs, then extract the values of the fields that interest  
me and
write these values to another file. I cannot go line by line since I  
want to
reuse, e.g., the value of the "Auteur(s)" and "Titre(s)" fields to  
generate a
key for every item, in the form of "doeargl" or "doewurgl" (via the  
split and
join functions) The problem is that not every entry contains every  
field (in my
example, #3 doesn't have an author), so I guess I need to test for the
existence of these fields before I can use their values.

2. The approach:

There is code here
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/408996/
which allows to read a file by paragraphs. I copied this to my script:

class FileIterator(object):
    """ A general purpose file object iterator cum
    file object proxy """

    def __init__(self, fw):
        self._fw = fw

    def readparagraphs(self):
        """ Paragraph iterator """

        # This re-uses Alex Martelli's
        # paragraph reading recipe.
        # Python Cookbook 2nd edition 19.10, Page 713
        paragraph = []
        for line in self._fw:
            if line.isspace():
                if paragraph:
                    yield "".join(paragraph)
                    paragraph = []
            else:
                paragraph.append(line)
        if paragraph:
            yield "".join(paragraph)

When I now run a very basic test:

for item in iter.readparagraphs():
     print item

The entire file is reprinted paragraph by paragraph, so this code  
appears to work.

I can also match the first line of every paragraph like so:

reBook = re.compile('^Type de notice : monographie')
for item in iter.readparagraphs():
     m1 = reBook.match(item)
     if m1:
	print "@Book{,"

this will print a line @Book{, for every "monographie" in the  
database -- a
good start, I thought!

3. The problem that's driving me insane

But as soon as I try to match anything inside of the paragraph:

reAuthor = re.compile('^Auteur\(s\) : (?P<author>.+)$')
    m2 = reAuthor.match(item)
    if m2:
        author = m2.group('author')
        print "author = {%s}," % author

I get no matches at all. I have tried to remove the ^ and the $ from  
the regex,
or to add the "re.DOTALL" flag, but to no avail.

4. My aim

I would like to have dictionary with fixed keys (the BibTeX field)  
and values
extracted from my file for every paragraph and then write this, in a  
proper
format, to a bibtex file. If a paragraph does not provide a value for a
particular key, I could then, in a second pass over the bibtex file,  
delete
these lines. But that means I first have to match and extract the  
values from
my parapgraphs.

What am I doing wrong? Or is the entire approach flawed? What  
alternative
method would you suggest?

Thanks for any help on this

Thomas


More information about the Tutor mailing list