How to efficiently extract information from structured text file

Imaginationworks xiajunyi at gmail.com
Wed Feb 17 09:35:57 EST 2010


On Feb 16, 7:14 pm, Gary Herron <gher... at islandtraining.com> wrote:
> Imaginationworks wrote:
> > Hi,
>
> > I am trying to read object information from a text file (approx.
> > 30,000 lines) with the following format, each line corresponds to a
> > line in the text file.  Currently, the whole file was read into a
> > string list using readlines(), then use for loop to search the "= {"
> > and "};" to determine the Object, SubObject,and SubSubObject. My
> > questions are
>
> > 1) Is there any efficient method that I can search the whole string
> > list to find the location of the tokens(such as '= {' or '};'
>
> Yes.   Read the *whole* file into a single string using file.read()
> method, and then search through the string using string methods (for
> simple things) or use re, the regular expression module, (for more
> complex searches).    
>
> Note:  There is a point where a file becomes large enough that reading
> the whole file into memory at once (either as a single string or as a
> list of strings) is foolish.    However, 30,000 lines doesn't push that
> boundary.
>
> > 2) Is there any efficient ways to extract the object information you
> > may suggest?
>
> Again, the re module has nice ways to find a pattern, and return parse
> out pieces of it.   Building a good regular expression takes time,
> experience, and a bit of black magic...    To do so for this case, we
> might need more knowledge of your format.  Also regular expressions have
> their limits.  For instance, if the sub objects can nest to any level,
> then in fact, regular expressions alone can't solve the whole problem,
> and you'll need a more robust parser.
>
> > Thanks,
>
> > - Jeremy
>
> > ===== Structured text file =================
> > Object1 = {
>
> > ...
>
> > SubObject1 = {
> > ....
>
> > SubSubObject1 = {
> > ...
> > };
> > };
>
> > SubObject2 = {
> > ....
>
> > SubSubObject21 = {
> > ...
> > };
> > };
>
> > SubObjectN = {
> > ....
>
> > SubSubObjectN = {
> > ...
> > };
> > };
> > };
>
>

Gary and Rhodri, Thank you for the suggestions.



More information about the Python-list mailing list