HTML Parser

Sat Dec 30 22:42:44 EST 2000

On Fri, 29 Dec 2000 10:26:31 -0500, Voitenko, Denis <dvoitenko at qode.com>
wrote:
>I am trying to write an HTML parser. I am starting off with a simple
>one like so:
>
...
>newline=re.compile('\n')
...
>input_file = file.read()
...
>jsp_content = newline.split(input_file)

Two things, neither of which answer your question (other have already
done that...):

First, you don't need to use re to split a file into lines. You could've
just said:

jsp_content = file.readlines()

(note that this, like your existing code, reads the entire file into
memory, which might not be a good idea if your file is huge)

Second, (this isn't Python related) you probably don't want to split
your file into lines in any case. HTML is *not* a line based language.
The following is a perfectly valid HTML tag:

    <img
        src="foo.jpg"
        width="640"
        height="480"
        alt="Howdy!"
    >

Your code wouldn't work with such tags, since it works line-by-line.

-- 
  C. Laurence Gonsalves                "Any sufficiently advanced
  clgonsal at kami.com                     technology is indistinguishable
  http://cryogen.com/clgonsal/          from magic." -- Arthur C. Clarke