How to efficiently extract information from structured text file

Gary Herron gherron at islandtraining.com
Tue Feb 16 20:14:43 EST 2010


Imaginationworks wrote:
> Hi,
>
> I am trying to read object information from a text file (approx.
> 30,000 lines) with the following format, each line corresponds to a
> line in the text file.  Currently, the whole file was read into a
> string list using readlines(), then use for loop to search the "= {"
> and "};" to determine the Object, SubObject,and SubSubObject. My
> questions are
>
> 1) Is there any efficient method that I can search the whole string
> list to find the location of the tokens(such as '= {' or '};'
>   

Yes.   Read the *whole* file into a single string using file.read() 
method, and then search through the string using string methods (for 
simple things) or use re, the regular expression module, (for more 
complex searches).     

Note:  There is a point where a file becomes large enough that reading 
the whole file into memory at once (either as a single string or as a 
list of strings) is foolish.    However, 30,000 lines doesn't push that 
boundary.
> 2) Is there any efficient ways to extract the object information you
> may suggest?
>   

Again, the re module has nice ways to find a pattern, and return parse 
out pieces of it.   Building a good regular expression takes time, 
experience, and a bit of black magic...    To do so for this case, we 
might need more knowledge of your format.  Also regular expressions have 
their limits.  For instance, if the sub objects can nest to any level, 
then in fact, regular expressions alone can't solve the whole problem, 
and you'll need a more robust parser.


> Thanks,
>
> - Jeremy
>
>
>
> ===== Structured text file =================
> Object1 = {
>
> ...
>
> SubObject1 = {
> ....
>
> SubSubObject1 = {
> ...
> };
> };
>
> SubObject2 = {
> ....
>
> SubSubObject21 = {
> ...
> };
> };
>
> SubObjectN = {
> ....
>
> SubSubObjectN = {
> ...
> };
> };
> };
>   




More information about the Python-list mailing list