[Tutor] extracting matches by paragraph
Thomas A. Schmitz
thomas.schmitz at uni-bonn.de
Wed Oct 11 10:48:32 CEST 2006
Please excuse this long mail. I have read several tutorials and
googled for
three days, but haven't made any real progress on this question,
probably
because I'm an absolute novice at python. I'd be very grateful for
some help.
1. My problem:
I have several files in a structured database format. They
contain entries like this:
Type de notice : monographie
Auteur(s) : John Doe
Titre(s) : Argl bargl
Publication : Denver, University of Colorado Press, 1776
Type de notice : article
Auteur(s) : Richard Doe
Titre(s) : wurgl burgl
Type de notice : recueil
Titre(s) : orgl gorgl
I want to translate this into a BibTeX format. My approach was to
read the file
in by paragraphs, then extract the values of the fields that interest
me and
write these values to another file. I cannot go line by line since I
want to
reuse, e.g., the value of the "Auteur(s)" and "Titre(s)" fields to
generate a
key for every item, in the form of "doeargl" or "doewurgl" (via the
split and
join functions) The problem is that not every entry contains every
field (in my
example, #3 doesn't have an author), so I guess I need to test for the
existence of these fields before I can use their values.
2. The approach:
There is code here
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/408996/
which allows to read a file by paragraphs. I copied this to my script:
class FileIterator(object):
""" A general purpose file object iterator cum
file object proxy """
def __init__(self, fw):
self._fw = fw
def readparagraphs(self):
""" Paragraph iterator """
# This re-uses Alex Martelli's
# paragraph reading recipe.
# Python Cookbook 2nd edition 19.10, Page 713
paragraph = []
for line in self._fw:
if line.isspace():
if paragraph:
yield "".join(paragraph)
paragraph = []
else:
paragraph.append(line)
if paragraph:
yield "".join(paragraph)
When I now run a very basic test:
for item in iter.readparagraphs():
print item
The entire file is reprinted paragraph by paragraph, so this code
appears to work.
I can also match the first line of every paragraph like so:
reBook = re.compile('^Type de notice : monographie')
for item in iter.readparagraphs():
m1 = reBook.match(item)
if m1:
print "@Book{,"
this will print a line @Book{, for every "monographie" in the
database -- a
good start, I thought!
3. The problem that's driving me insane
But as soon as I try to match anything inside of the paragraph:
reAuthor = re.compile('^Auteur\(s\) : (?P<author>.+)$')
m2 = reAuthor.match(item)
if m2:
author = m2.group('author')
print "author = {%s}," % author
I get no matches at all. I have tried to remove the ^ and the $ from
the regex,
or to add the "re.DOTALL" flag, but to no avail.
4. My aim
I would like to have dictionary with fixed keys (the BibTeX field)
and values
extracted from my file for every paragraph and then write this, in a
proper
format, to a bibtex file. If a paragraph does not provide a value for a
particular key, I could then, in a second pass over the bibtex file,
delete
these lines. But that means I first have to match and extract the
values from
my parapgraphs.
What am I doing wrong? Or is the entire approach flawed? What
alternative
method would you suggest?
Thanks for any help on this
Thomas
More information about the Tutor
mailing list