[Tutor] Tokenizing Help
William Witteman
yam at nerd.cx
Wed Apr 22 22:18:30 CEST 2009
On Wed, Apr 22, 2009 at 09:23:30PM +0200, spir wrote:
>> I need to be able to decompose a formatted text file into identifiable,
>> possibly named pieces. To tokenize it, in other words. There seem to
>> be a vast array of modules to do this with (simpleparse, pyparsing etc)
>> but I cannot understand their documentation.
>
>I would recommand pyparsing, but this is an opinion.
It looked like a good package to me as well, but I cannot see how to
define the grammar - it may be that the notation just doesn't make sense
to me.
>Regular expressions may be enough, depending on your actual needs.
Perhaps, but I am cautious, because every text and most websites
discourage regexes for parsing.
>The question is: what do you need from the data? What do you expect as result? The best is to provide an example of result matching sample data. E.G. I wish as result a dictionary looking like
>{
>'AU': 'some text\nperhaps across lines'
>'AB': ['some other text', 'there may be multiples of some fields']
>'UN': 'any 2-letter combination may exist...'
>...
>}
I think that a dictionary could work, but it would have to use lists as
the value, to prevent key collisions. That said, returning a list of
dictionaries (one dictionary per bibliographic reference) would work very
well in the large context of my program.
>From this depends the choice of an appropriate tool and hints on possible algorithms.
I hope this helps. I spent quite some time with pyparsing, but I was
never able to express the rules of my grammar based on the examples on
the website.
--
yours,
William
More information about the Tutor
mailing list