[Tutor] extracting matches by paragraph
Kent Johnson
kent37 at tds.net
Wed Oct 11 14:48:56 CEST 2006
Thomas A. Schmitz wrote:
> On Oct 11, 2006, at 12:06 PM, Kent Johnson wrote:
>
>> I would take out the join in this, at least, and return a list of
>> lines. You don't really have a paragraph, you have structured data.
>> There is not need to throw away the structure.
>>
>> It might be even more useful to return a dictionary that maps field
>> names to values. Also there doesn't seem to be any reason to make
>> FileIterator a class, you can use just a generator function (Dick
>> Moores take notice!):
>>
>> def readparagraphs(fw):
>> self._fw = fw
>>
>> data = {}
>> for line in fw:
>> if line.isspace():
>> if data:
>> yield data
>> data = {}
>> else:
>> key, value = line.split(' : ')
>> data[key] = value
>> if data:
>> yield data
>>
>> Now you don't need a regexp, you have usable data directly from the
>> iterator.
>>
>
> Thank you for your help, Kent! But I'm not sure if this is
> practicable. As I said, a line-by-line approach does not work,
What I have outlined is not a line-by-line approach, it is still
returning data for a paragraph at a time, but in a more usable format
than your original iterator.
Try printing out the values you get from the iterator, the same way you
did with your original paragraph iterator.
> for
> two reasons:
> 1. I want to combine and translate the results from two lines;
You can do that with this approach.
> 2. in the file, there are lines of the form
> Publication : Denver, University of Colorado Press, 1776
> from which I need to extract three values (address, publisher, date),
> and I may need to discard some other stuff from other lines. So I do
> need a regex, I think. Unfortunately, the structure is not strong
> enough to make a one on one translation viable, so I do need to
> extract the values...
Ok, so the dict from the iterator is still raw data that needs some
processing. Something like
for para in readparagraphs(open('mydata.txt')):
# Your previous example
if para.get('Type de notice') == 'monographie':
print "@Book{,"
# publication data
pubData = para.get('Publication')
if pubData:
address, publisher, date = pubData.split(', ')
# do something with address, etc.
Kent
PS Please reply on-list.
More information about the Tutor
mailing list