html parser , unexpected '<' char in declaration

Mon Feb 20 20:01:39 EST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sakcee wrote:
> html =
> '<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> <head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
> </body></html>'
> 
> 

html =
	"""
	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
	<html>
	 <head>
	 </head>
         <body bgcolor="#ffffff">
		Foo foo , blah blah
	 </body>
	</html>
	"""

Try checking your html code. It looks really messy. ' char is not for
multiple line strings. You can try the code above.

As a suggestion, you should really focus on learning html basics ;)

Regards

Jesus (Neurogeek)

>>>>import htmllib
>>>>import formatter
>>>>parser=htmllib.HTMLParser(formatter.NullFormatter())
>>>>parser.feed(html)
> 
> 
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>   File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
>     self.goahead(0)
>   File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
>     k = self.parse_declaration(i)
> File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
>     self.error(
>   File "/usr/lib/python2.4/htmllib.py", line 40, in error
>     raise HTMLParseError(message)
> htmllib.HTMLParseError: unexpected '<' char in declaration
> 
> 
> the error is generated by unclosed DOCTYPE declaration
> 
> what is the best way to handle this kind of document. should I use
> regex to check and strip, or does HTMLParser offers something? , can i
> override default sgmllib behaviour
> I have to work with this htmllib because of existing modules .
> 
> 
> thanks
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+mZzdIssYB9vBoMRAoWXAJ9KuAnLLXhZVv4t6fDBpu3RW6oxFgCeM/1S
iNScofTDdJxLfOkaAR9Ejws=
=+LTo
-----END PGP SIGNATURE-----