[Tutor] extracting matches by paragraph
Kent Johnson
kent37 at tds.net
Wed Oct 11 12:06:46 CEST 2006
Thomas A. Schmitz wrote:
> Please excuse this long mail. I have read several tutorials and
> googled for
> three days, but haven't made any real progress on this question,
> probably
> because I'm an absolute novice at python. I'd be very grateful for
> some help.
>
> 1. My problem:
>
> I have several files in a structured database format. They
> contain entries like this:
>
> Type de notice : monographie
> Auteur(s) : John Doe
> Titre(s) : Argl bargl
> Publication : Denver, University of Colorado Press, 1776
>
> Type de notice : article
> Auteur(s) : Richard Doe
> Titre(s) : wurgl burgl
>
> Type de notice : recueil
> Titre(s) : orgl gorgl
>
> I want to translate this into a BibTeX format. My approach was to
> read the file
> in by paragraphs, then extract the values of the fields that interest
> me and
> write these values to another file. I cannot go line by line since I
> want to
> reuse, e.g., the value of the "Auteur(s)" and "Titre(s)" fields to
> generate a
> key for every item, in the form of "doeargl" or "doewurgl" (via the
> split and
> join functions) The problem is that not every entry contains every
> field (in my
> example, #3 doesn't have an author), so I guess I need to test for the
> existence of these fields before I can use their values.
>
> 2. The approach:
>
> There is code here
> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/408996/
> which allows to read a file by paragraphs. I copied this to my script:
>
> class FileIterator(object):
> """ A general purpose file object iterator cum
> file object proxy """
>
> def __init__(self, fw):
> self._fw = fw
>
> def readparagraphs(self):
> """ Paragraph iterator """
>
> # This re-uses Alex Martelli's
> # paragraph reading recipe.
> # Python Cookbook 2nd edition 19.10, Page 713
> paragraph = []
> for line in self._fw:
> if line.isspace():
> if paragraph:
> yield "".join(paragraph)
> paragraph = []
> else:
> paragraph.append(line)
> if paragraph:
> yield "".join(paragraph)
>
> When I now run a very basic test:
>
> for item in iter.readparagraphs():
> print item
>
> The entire file is reprinted paragraph by paragraph, so this code
> appears to work.
I would take out the join in this, at least, and return a list of lines.
You don't really have a paragraph, you have structured data. There is
not need to throw away the structure.
It might be even more useful to return a dictionary that maps field
names to values. Also there doesn't seem to be any reason to make
FileIterator a class, you can use just a generator function (Dick Moores
take notice!):
def readparagraphs(fw):
self._fw = fw
data = {}
for line in fw:
if line.isspace():
if data:
yield data
data = {}
else:
key, value = line.split(' : ')
data[key] = value
if data:
yield data
Now you don't need a regexp, you have usable data directly from the
iterator.
> I can also match the first line of every paragraph like so:
>
> reBook = re.compile('^Type de notice : monographie')
> for item in iter.readparagraphs():
> m1 = reBook.match(item)
> if m1:
> print "@Book{,"
>
> this will print a line @Book{, for every "monographie" in the
> database -- a
> good start, I thought!
>
> 3. The problem that's driving me insane
>
> But as soon as I try to match anything inside of the paragraph:
>
> reAuthor = re.compile('^Auteur\(s\) : (?P<author>.+)$')
> m2 = reAuthor.match(item)
> if m2:
> author = m2.group('author')
> print "author = {%s}," % author
>
> I get no matches at all. I have tried to remove the ^ and the $ from
> the regex,
> or to add the "re.DOTALL" flag, but to no avail.
You need re.MULTILINE to modify the meaning of ^ and $. re.DOTALL
affects whether . matches newlines.
>
> 4. My aim
>
> I would like to have dictionary with fixed keys (the BibTeX field)
> and values
> extracted from my file for every paragraph and then write this, in a
> proper
> format, to a bibtex file. If a paragraph does not provide a value for a
> particular key, I could then, in a second pass over the bibtex file,
> delete
> these lines.
I would write the code to exclude those lines in the first place. If the
dict returned from readparagraphs() is missing a key, then don't write
the corresponding line.
Kent
More information about the Tutor
mailing list