getroot() problem

Sun Oct 23 21:22:32 EDT 2011

On 10/23/2011 09:06 PM, Ë®¾²Á÷Éî wrote:
> C:\Documents and Settings\peng>cd c:\python32
>
>
>
> C:\Python32>python
>
> Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
>
> 32
>
> Type "help", "copyright", "credits" or "license" for more information.
>
>>>> import lxml.html
>
>>>> sfile='http://finance.yahoo.com/q/op?s=A+Options'
>
>>>> root=lxml.html.parse(sfile).getroot()
> there is no problem to  parse  :
>
>
> http://finance.yahoo.com/q/op?s=A+Options'
>
>
>
>
> why  i can not parse
>
> http://frux.wikispaces.com/  ??
>
>>>> import lxml.html
>
>>>> sfile='http://frux.wikispaces.com/'
>
>>>> root=lxml.html.parse(sfile).getroot()
>
> Traceback (most recent call last):
>
>    File "<stdin>", line 1, in<module>
>
>    File "C:\Python32\lib\site-packages\lxml\html\__init__.py", line 692, in parse
>
>
>
>      return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
>
>    File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:5
>
> 4187)
>
>    File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etre
>
> e.c:79485)
>
>    File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lx
>
> ml.etree.c:79768)
>
>    File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
>
> tree.c:78843)
>
>    File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/
>
> lxml/lxml.etree.c:75698)
>
>    File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDo
>
> c (src/lxml/lxml.etree.c:71739)
>
>    File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.e
>
> tree.c:72614)
>
>    File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etr
>
> ee.c:71927)
>
> IOError: Error reading file 'b'http://frux.wikispaces.com/'': b'failed to load e
>
> xternal entity "http://frux.wikispaces.com/"'
>
>>> >
Double-spacing makes your message much harder to read. I can only 
comment in a general way, in any case. most html is mal-formed, and not 
legal html. Although I don't have any experience with parsing it, I do 
with xml which has similar problems.

The first thing I'd do is to separate the loading of the byte string 
from the website, from the parsing of those bytes. Further, I'd make a 
local copy of those bytes, so you can do testing repeatably. For 
example, you could run wget utility to copy the bytes locally and create 
a file.
-- 

DaveA