How to efficiently extract information from structured text file

Jonathan Gardner jgardner at jonathangardner.net
Thu Feb 18 02:13:23 CET 2010


On Feb 16, 3:48 pm, Imaginationworks <xiaju... at gmail.com> wrote:
> Hi,
>
> I am trying to read object information from a text file (approx.
> 30,000 lines) with the following format, each line corresponds to a
> line in the text file.  Currently, the whole file was read into a
> string list using readlines(), then use for loop to search the "= {"
> and "};" to determine the Object, SubObject,and SubSubObject. My
> questions are
>
> 1) Is there any efficient method that I can search the whole string
> list to find the location of the tokens(such as '= {' or '};'
>
> 2) Is there any efficient ways to extract the object information you
> may suggest?

Parse it!

Go full-bore with a real parser. You may want to consider one of the
many fine Pythonic implementations of modern parsers, or break out
more traditional parsing tools.

This format is nested, meaning that you can't use regexes to parse
what you want out of it. You're going to need a real, full-bore, no-
holds-barred parser for this.

Don't worry, the road is not easy but the destination is absolutely
worth it.

Once you come to appreciate and understand parsing, you have earned
the right to call yourself a red-belt programmer. To get your black-
belt, you'll need to write your own compiler. Having mastered these
two tasks, there is no problem you cannot tackle.

And once you realize that every program is really a compiler, then you
have truly mastered the Zen of Programming in Any Programming Language
That Will Ever Exist.

With this understanding, you will judge programming language utility
based solely on how hard it is to write a compiler in it, and
complexity based on how hard it is to write a compiler for it. (Notice
there are not a few parsers written in Python, as well as Jython and
PyPy and others written for Python!)



More information about the Python-list mailing list