[Tutor] extracting matches by paragraph

Wed Oct 11 14:48:56 CEST 2006

Thomas A. Schmitz wrote:
> On Oct 11, 2006, at 12:06 PM, Kent Johnson wrote:
> 
>> I would take out the join in this, at least, and return a list of  
>> lines. You don't really have a paragraph, you have structured data.  
>> There is not need to throw away the structure.
>>
>> It might be even more useful to return a dictionary that maps field  
>> names to values. Also there doesn't seem to be any reason to make  
>> FileIterator a class, you can use just a generator function (Dick  
>> Moores take notice!):
>>
>> def readparagraphs(fw):
>>         self._fw = fw
>>
>>         data = {}
>>         for line in fw:
>>             if line.isspace():
>>                 if data:
>>                     yield data
>>                     data = {}
>>             else:
>>                 key, value = line.split(' : ')
>>                 data[key] = value
>>         if data:
>>             yield data
>>
>> Now you don't need a regexp, you have usable data directly from the  
>> iterator.
>>
> 
> Thank you for your help, Kent! But I'm not sure if this is  
> practicable. As I said, a line-by-line approach does not work,

What I have outlined is not a line-by-line approach, it is still 
returning data for a paragraph at a time, but in a more usable format 
than your original iterator.

Try printing out the values you get from the iterator, the same way you 
did with your original paragraph iterator.

> for  
> two reasons:
> 1. I want to combine and translate the results from two lines;

You can do that with this approach.

> 2. in the file, there are lines of the form
> Publication : Denver, University of Colorado Press, 1776
> from which I need to extract three values (address, publisher, date),  
> and I may need to discard some other stuff from other lines. So I do  
> need a regex, I think. Unfortunately, the structure is not strong  
> enough to make a one on one translation viable, so I do need to  
> extract the values...

Ok, so the dict from the iterator is still raw data that needs some 
processing. Something like

for para in readparagraphs(open('mydata.txt')):
   # Your previous example
   if para.get('Type de notice') == 'monographie':
     print "@Book{,"

   # publication data
   pubData = para.get('Publication')
   if pubData:
     address, publisher, date = pubData.split(', ')
     # do something with address, etc.

Kent

PS Please reply on-list.