Chris<div><br></div><div>This block of code made my day - especially yummydataaddrs & "here's your stupid data"</div><div><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">
<span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; ">for start,end in yummydataaddrs:<br></span><span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; "> fd.seek(start)<br>
</span><span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; "> print "here's your stupid data:", fd.read(end-start+1)</span></blockquote><div><br>
</div><div>Nothing is more impressive than solid code, with a good sense of humor. </div><div><br></div><div>Thanks for the code & especially since i am in a time crunch, this approach, might get me what i need more quickly.</div>
<div><br></div><div>Thanks also for Knuth's awesome quote & reminded me of my stanford friend who told me that Prof. Knuth, still holds a christmas tree lecture every year...unfortunately inspite of being in the bay area this year, i missed it :(</div>
<div><a href="http://stanford-online.stanford.edu/seminars/knuth/101206-knuth-500.asx">http://stanford-online.stanford.edu/seminars/knuth/101206-knuth-500.asx</a></div><div><br></div><div>Thanks a ton</div><div><br></div>
<div>cheers</div><div>ashish</div><div><br></div><div>p.s. To everybody</div><div><br></div><div>OT(off_topic): I moved to the bay area recently & am passionate about technology in general & linux, python, c, embedded, mobile, wireless stuff,.....</div>
<div>I was wondering if any of you guys, are part of some bay area python( or other tech) meetup ( as in do you guys meetup, in person) for like a tech talk / discussion / brainstorming/ hack nights ?</div><div>If yes, i would love to know more & be a part of it</div>
<br><div class="gmail_quote">On Mon, Dec 20, 2010 at 9:27 PM, Chris Fuller <span dir="ltr"><<a href="mailto:cfuller084@thinkingplanet.net">cfuller084@thinkingplanet.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
This isn't XML, it's an abomination of XML. Best to not treat it as XML.<br>
Good thing you're only after one class of tags. Here's what I'd do. I'll<br>
give a general solution, but there are two parameters / four cases that could<br>
make the code simpler, I'll just point them out at the end.<br>
<br>
Iterate over the file descriptor, reading in line-by-line. This will be slow<br>
on a huge file, but probably not so bad if you're only doing it once. It makes<br>
the rest easier. Knuth has some sage advice on this point (*) :) Some<br>
feedback on progress to the user can be helpful here, if it is slow.<br>
<br>
Keep track of your offset into the file. There are two ways: use the tell()<br>
method of the file descriptor (but you will have to subtract the length of the<br>
current line), or just add up the line lengths as you process them.<br>
<br>
Scan each line for the open tag. Add the offset to the tag to the offset within<br>
the file of the current line, and push that to a stack. Scan for the end tag,<br>
when you find one, pop an address from the stack, and put the two (start/end)<br>
addresses a list for later. Keep doing this until you run out of file.<br>
<br>
Now, take that list, and pull off the address-pairs; seek() and read() them<br>
directly. Lather, rinse, repeat.<br>
<br>
Some off-the-cuff untested code:<br>
<br>
stk = []<br>
yummydataaddrs = []<br>
<br>
fileoff = 0<br>
<br>
fd = open('ginormous.xml', 'r')<br>
for line in fd:<br>
lineoff = line.index(start_tag)<br>
if fileoff != -1:<br>
stk.append(fileoff+lineoff)<br>
<br>
lineoff = line.index(end_tag)<br>
if lineoff != -1:<br>
yummydataaddr.append( (stk.pop(-1), fileoff+lineoff) )<br>
<br>
fileoff += len(line)<br>
<br>
for start,end in yummydataaddrs:<br>
fd.seek(start)<br>
print "here's your stupid data:", fd.read(end-start+1)<br>
<br>
<br>
You can simplify a bit if the tags are one a line by themselves, since you<br>
don't have to keep track of the offset with the line of the tag. The other<br>
simplification is if they aren't nested. You don't need to mess around with a<br>
stack in this case.<br>
<br>
<br>
(*) "Premature optimization is the root of all evil."<br>
<br>
<br>
Cheers<br>
<div><div></div><div class="h5">_______________________________________________<br>
Tutor maillist - <a href="mailto:Tutor@python.org">Tutor@python.org</a><br>
To unsubscribe or change subscription options:<br>
<a href="http://mail.python.org/mailman/listinfo/tutor" target="_blank">http://mail.python.org/mailman/listinfo/tutor</a><br>
</div></div></blockquote></div><br><br clear="all"><br><i>"We act as though comfort and luxury were the chief requirements of life, when all that we need to make us happy is something to be enthusiastic about." <br>
-- Albert Einstein</i><br>
</div>