[Tutor] Extracting data from HTML files

Oswaldo Martinez motorolaguy at gmx.net
Fri Dec 30 23:39:30 CET 2005


> From: Danny Yoo <dyoo at hkn.eecs.berkeley.edu>
> 
[...]
> The Regular Expression HOWTO itself is pretty good and talks about some of
> the stuff you've been running into, so here's a link to the base url that
> you may want to look at:
> 
>     http://www.amk.ca/python/howto/regex/


Ah yes I`ve been reading that same doc and got confused on the use of the
"r" I guess


[......]
 
> > I also found that on some of the strings I want to extract, when python
> > reads them using file.read(), there are newline characters and other
> > stuff that doesn`t show up in the actual html source.
> 
> Not certain that I understand what you mean there.  Can you show us?
> read() should not adulterate the byte stream that comes out of your
>files.
 
>>> file = open("file1.html")
>>> file.read()
'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML
4.01
Transitional//EN"\r\n"http://www.w3.org/TR/html4/loose.dtd">\r\n<html>\r\n<head>\r\n<!--
Script for select box changes -->\r\n<script type="text/javascript">\r\n
[...]

That`s just a snippet from the html code.I`m guessing it won`t cause any
problems since it`s just the newlines from reading the HTML code and not
actually *in* the code.

[...]
> As an aside: the problems you're running into is very much why we
> encourage folks not to process HTML with regular expressions: RE's also
> come with their own somewhat-high learning curve.
>
> Good luck to you.

Yes I`m seeing this right now hehe....but since all the files I have to
process have the same structure (they were generated by a script) I think it
might be easier to use RE`s here. Do you have any idea of what other tool I
can use? I took a look at BeautifulSoup but it seemed a bit overkill and
very much over my current python knowledge. 

Thanks!

-- 
10 GB Mailbox, 100 FreeSMS/Monat http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++


More information about the Tutor mailing list