How to efficiently extract information from structured text file
Gary Herron
gherron at islandtraining.com
Tue Feb 16 20:14:43 EST 2010
Imaginationworks wrote:
> Hi,
>
> I am trying to read object information from a text file (approx.
> 30,000 lines) with the following format, each line corresponds to a
> line in the text file. Currently, the whole file was read into a
> string list using readlines(), then use for loop to search the "= {"
> and "};" to determine the Object, SubObject,and SubSubObject. My
> questions are
>
> 1) Is there any efficient method that I can search the whole string
> list to find the location of the tokens(such as '= {' or '};'
>
Yes. Read the *whole* file into a single string using file.read()
method, and then search through the string using string methods (for
simple things) or use re, the regular expression module, (for more
complex searches).
Note: There is a point where a file becomes large enough that reading
the whole file into memory at once (either as a single string or as a
list of strings) is foolish. However, 30,000 lines doesn't push that
boundary.
> 2) Is there any efficient ways to extract the object information you
> may suggest?
>
Again, the re module has nice ways to find a pattern, and return parse
out pieces of it. Building a good regular expression takes time,
experience, and a bit of black magic... To do so for this case, we
might need more knowledge of your format. Also regular expressions have
their limits. For instance, if the sub objects can nest to any level,
then in fact, regular expressions alone can't solve the whole problem,
and you'll need a more robust parser.
> Thanks,
>
> - Jeremy
>
>
>
> ===== Structured text file =================
> Object1 = {
>
> ...
>
> SubObject1 = {
> ....
>
> SubSubObject1 = {
> ...
> };
> };
>
> SubObject2 = {
> ....
>
> SubSubObject21 = {
> ...
> };
> };
>
> SubObjectN = {
> ....
>
> SubSubObjectN = {
> ...
> };
> };
> };
>
More information about the Python-list
mailing list