[Tutor] Tokenizing Help

Wed Apr 22 22:18:30 CEST 2009

On Wed, Apr 22, 2009 at 09:23:30PM +0200, spir wrote:

>> I need to be able to decompose a formatted text file into identifiable,
>> possibly named pieces.  To tokenize it, in other words.  There seem to
>> be a vast array of modules to do this with (simpleparse, pyparsing etc)
>> but I cannot understand their documentation.
>
>I would recommand pyparsing, but this is an opinion.

It looked like a good package to me as well, but I cannot see how to
define the grammar - it may be that the notation just doesn't make sense
to me.

>Regular expressions may be enough, depending on your actual needs.

Perhaps, but I am cautious, because every text and most websites
discourage regexes for parsing.

>The question is: what do you need from the data? What do you expect as result? The best is to provide an example of result matching sample data. E.G. I wish as result a dictionary looking like
>{
>'AU': 'some text\nperhaps across lines'
>'AB': ['some other text', 'there may be multiples of some fields']
>'UN': 'any 2-letter combination may exist...'
>...
>}

I think that a dictionary could work, but it would have to use lists as
the value, to prevent key collisions.  That said, returning a list of
dictionaries (one dictionary per bibliographic reference) would work very 
well in the large context of my program.

>From this depends the choice of an appropriate tool and hints on possible algorithms.

I hope this helps.  I spent quite some time with pyparsing, but I was
never able to express the rules of my grammar based on the examples on
the website.
-- 

yours,

William