[Tutor] Extract strings from a text file

Paul McGuire ptmcg at austin.rr.com
Fri Feb 27 16:01:44 CET 2009


Emad wrote: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Since I'm learning Pyparsing, this was a nice excercise. I've written this
elementary script which does the job well in light of the data we have

from pyparsing import *
ID_TAG = Literal("<ID>")
FULL_NAME_TAG1 = Literal("<Full")
FULL_NAME_TAG2 = Literal("name>")
END_TAG = Literal("</")
word = Word(alphas)
pattern1 = ID_TAG + word + END_TAG
pattern2 = FULL_NAME_TAG1 + FULL_NAME_TAG2 + OneOrMore(word) + END_TAG
result = pattern1 | pattern2

lines = open("lines.txt")# This is your file name
for line in lines:
    myresult = result.searchString(line)
    if myresult:
        print myresult[0]


# This prints out
['<ID>', 'Joseph', '</']
['<Full', 'name>', 'Joseph', 'Smith', '</']

# You can access the individual elements of the lists to pick whatever you
want

Emad -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Welcome to the world of pyparsing!  Your program is a very good first cut at
this problem.  Let me add some suggestions (more like hints toward more
advanced concepts in your pyparsing learning):
- Look into Group, as in Group(OneOrMore(word)), this will add organization
and structure to the returned results.
- Results names will make it easier to access the separate parsed fields.
- Check out the makeHTMLTags and makeXMLTags helper methods - these do more
than just wrap angle brackets around a tag name, but also handle attributes
in varying order, case variability, and (of course) varying whitespace - the
OP didn't explicitly say this XML data, but the sample does look suspicious.

If you only easy_install'ed pyparsing or used the binary windows installer,
please go back to SourceForge and download the source .ZIP or tarball
package - these have the full examples and htmldoc directories that the
auto-installers omit.

Good luck in your continued studies!
-- Paul



More information about the Tutor mailing list