[Tutor] extracting matches by paragraph

Wed Oct 11 15:18:19 CEST 2006

On Oct 11, 2006, at 2:48 PM, Kent Johnson wrote:

> Thomas A. Schmitz wrote:
>> On Oct 11, 2006, at 12:06 PM, Kent Johnson wrote:
>>
>>> I would take out the join in this, at least, and return a list of
>>> lines. You don't really have a paragraph, you have structured data.
>>> There is not need to throw away the structure.
>>>
>>> It might be even more useful to return a dictionary that maps field
>>> names to values. Also there doesn't seem to be any reason to make
>>> FileIterator a class, you can use just a generator function (Dick
>>> Moores take notice!):
>>>
>>> def readparagraphs(fw):
>>>         self._fw = fw
>>>
>>>         data = {}
>>>         for line in fw:
>>>             if line.isspace():
>>>                 if data:
>>>                     yield data
>>>                     data = {}
>>>             else:
>>>                 key, value = line.split(' : ')
>>>                 data[key] = value
>>>         if data:
>>>             yield data
>>>
>>> Now you don't need a regexp, you have usable data directly from the
>>> iterator.
>>>
>>
>> Thank you for your help, Kent! But I'm not sure if this is
>> practicable. As I said, a line-by-line approach does not work,
>
> What I have outlined is not a line-by-line approach, it is still
> returning data for a paragraph at a time, but in a more usable format
> than your original iterator.
>
> Try printing out the values you get from the iterator, the same way  
> you
> did with your original paragraph iterator.
>
>> for
>> two reasons:
>> 1. I want to combine and translate the results from two lines;
>
> You can do that with this approach.
>
>> 2. in the file, there are lines of the form
>> Publication : Denver, University of Colorado Press, 1776
>> from which I need to extract three values (address, publisher, date),
>> and I may need to discard some other stuff from other lines. So I do
>> need a regex, I think. Unfortunately, the structure is not strong
>> enough to make a one on one translation viable, so I do need to
>> extract the values...
>
> Ok, so the dict from the iterator is still raw data that needs some
> processing. Something like
>
> for para in readparagraphs(open('mydata.txt')):
>    # Your previous example
>    if para.get('Type de notice') == 'monographie':
>      print "@Book{,"
>
>    # publication data
>    pubData = para.get('Publication')
>    if pubData:
>      address, publisher, date = pubData.split(', ')
>      # do something with address, etc.
>
> Kent
>
> PS Please reply on-list.
>

Kent,

thanks again, I begin to see some light and will try to make sense of  
it. I had developed the bad habit of just treating such data as  
strings, but of course it makes much more sense to keep the original  
format and then convert it to another format.

Sorry for mindlessly hitting the reply button, didn't mean to intrude  
on you.

Thanks a lot

Thomas