Chris<div><br></div><div>This block of code made my day - especially yummydataaddrs &amp; &quot;here&#39;s your stupid data&quot;</div><div><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">

<span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; ">for start,end in yummydataaddrs:<br></span><span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; ">   fd.seek(start)<br>

</span><span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; ">   print &quot;here&#39;s your stupid data:&quot;, fd.read(end-start+1)</span></blockquote><div><br>

</div><div>Nothing is more impressive than solid code, with a good sense of humor.  </div><div><br></div><div>Thanks for the code &amp; especially since i am in a time crunch, this approach, might get me what i need more quickly.</div>

<div><br></div><div>Thanks also for Knuth&#39;s awesome quote &amp; reminded me of my stanford friend who told me that Prof. Knuth, still holds a christmas tree lecture every year...unfortunately inspite of being in the bay area this year, i missed it :(</div>

<div><a href="http://stanford-online.stanford.edu/seminars/knuth/101206-knuth-500.asx">http://stanford-online.stanford.edu/seminars/knuth/101206-knuth-500.asx</a></div><div><br></div><div>Thanks a ton</div><div><br></div>

<div>cheers</div><div>ashish</div><div><br></div><div>p.s. To everybody</div><div><br></div><div>OT(off_topic): I moved to the bay area recently &amp; am passionate about technology in general &amp; linux, python, c, embedded, mobile, wireless stuff,.....</div>

<div>I was wondering if any of you guys, are part of some bay area python( or other tech) meetup ( as in do you guys meetup, in person) for like a tech talk / discussion / brainstorming/ hack nights ?</div><div>If yes, i would love to know more &amp; be a part of it</div>

<br><div class="gmail_quote">On Mon, Dec 20, 2010 at 9:27 PM, Chris Fuller <span dir="ltr">&lt;<a href="mailto:cfuller084@thinkingplanet.net">cfuller084@thinkingplanet.net</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

This isn&#39;t XML, it&#39;s an abomination of XML.  Best to not treat it as XML.<br>

Good thing you&#39;re only after one class of tags.  Here&#39;s what I&#39;d do.  I&#39;ll<br>

give a general solution, but there are two parameters / four cases that could<br>

make the code simpler, I&#39;ll just point them out at the end.<br>

<br>

Iterate over the file descriptor, reading in line-by-line.  This will be slow<br>

on a huge file, but probably not so bad if you&#39;re only doing it once.  It makes<br>

the rest easier.  Knuth has some sage advice on this point (*) :)  Some<br>

feedback on progress to the user can be helpful here, if it is slow.<br>

<br>

Keep track of your offset into the file.  There are two ways: use the tell()<br>

method of the file descriptor (but you will have to subtract the length of the<br>

current line), or just add up the line lengths as you process them.<br>

<br>

Scan each line for the open tag.  Add the offset to the tag to the offset within<br>

the file of  the current line, and push that to a stack.  Scan for the end tag,<br>

when you find one, pop an address from the stack, and put the two (start/end)<br>

addresses a list for later.  Keep doing this until you run out of file.<br>

<br>

Now, take that list, and pull off the address-pairs; seek() and read() them<br>

directly.  Lather, rinse, repeat.<br>

<br>

Some off-the-cuff untested code:<br>

<br>

stk = []<br>

yummydataaddrs = []<br>

<br>

fileoff = 0<br>

<br>

fd = open(&#39;ginormous.xml&#39;, &#39;r&#39;)<br>

for line in fd:<br>

    lineoff = line.index(start_tag)<br>

    if fileoff != -1:<br>

        stk.append(fileoff+lineoff)<br>

<br>

    lineoff = line.index(end_tag)<br>

    if lineoff != -1:<br>

        yummydataaddr.append( (stk.pop(-1), fileoff+lineoff) )<br>

<br>

    fileoff += len(line)<br>

<br>

for start,end in yummydataaddrs:<br>

    fd.seek(start)<br>

    print &quot;here&#39;s your stupid data:&quot;, fd.read(end-start+1)<br>

<br>

<br>

You can simplify a bit if the tags are one a line by themselves, since you<br>

don&#39;t have to keep track of the offset with the line of the tag.  The other<br>

simplification is if they aren&#39;t nested.  You don&#39;t need to mess around with a<br>

stack in this case.<br>

<br>

<br>

(*) &quot;Premature optimization is the root of all evil.&quot;<br>

<br>

<br>

Cheers<br>

<div><div></div><div class="h5">_______________________________________________<br>

Tutor maillist  -  <a href="mailto:Tutor@python.org">Tutor@python.org</a><br>

To unsubscribe or change subscription options:<br>

<a href="http://mail.python.org/mailman/listinfo/tutor" target="_blank">http://mail.python.org/mailman/listinfo/tutor</a><br>

</div></div></blockquote></div><br><br clear="all"><br><i>&quot;We act as though comfort and luxury were the chief requirements of life, when all that we need to make us happy is something to be enthusiastic about.&quot; <br>

-- Albert Einstein</i><br>

</div>