html parser , unexpected '<' char in declaration
Jesus Rivero - (Neurogeek)
jrivero at latinux.org
Mon Feb 20 20:01:39 EST 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Sakcee wrote:
> html =
> '<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> <head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
> </body></html>'
>
>
html =
"""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff">
Foo foo , blah blah
</body>
</html>
"""
Try checking your html code. It looks really messy. ' char is not for
multiple line strings. You can try the code above.
As a suggestion, you should really focus on learning html basics ;)
Regards
Jesus (Neurogeek)
>>>>import htmllib
>>>>import formatter
>>>>parser=htmllib.HTMLParser(formatter.NullFormatter())
>>>>parser.feed(html)
>
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
> self.goahead(0)
> File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
> k = self.parse_declaration(i)
> File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
> self.error(
> File "/usr/lib/python2.4/htmllib.py", line 40, in error
> raise HTMLParseError(message)
> htmllib.HTMLParseError: unexpected '<' char in declaration
>
>
> the error is generated by unclosed DOCTYPE declaration
>
> what is the best way to handle this kind of document. should I use
> regex to check and strip, or does HTMLParser offers something? , can i
> override default sgmllib behaviour
> I have to work with this htmllib because of existing modules .
>
>
> thanks
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFD+mZzdIssYB9vBoMRAoWXAJ9KuAnLLXhZVv4t6fDBpu3RW6oxFgCeM/1S
iNScofTDdJxLfOkaAR9Ejws=
=+LTo
-----END PGP SIGNATURE-----
More information about the Python-list
mailing list