getroot() problem
Dave Angel
d at davea.name
Sun Oct 23 21:22:32 EDT 2011
On 10/23/2011 09:06 PM, Ë®¾²Á÷Éî wrote:
> C:\Documents and Settings\peng>cd c:\python32
>
>
>
> C:\Python32>python
>
> Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
>
> 32
>
> Type "help", "copyright", "credits" or "license" for more information.
>
>>>> import lxml.html
>
>>>> sfile='http://finance.yahoo.com/q/op?s=A+Options'
>
>>>> root=lxml.html.parse(sfile).getroot()
> there is no problem to parse :
>
>
> http://finance.yahoo.com/q/op?s=A+Options'
>
>
>
>
> why i can not parse
>
> http://frux.wikispaces.com/ ??
>
>>>> import lxml.html
>
>>>> sfile='http://frux.wikispaces.com/'
>
>>>> root=lxml.html.parse(sfile).getroot()
>
> Traceback (most recent call last):
>
> File "<stdin>", line 1, in<module>
>
> File "C:\Python32\lib\site-packages\lxml\html\__init__.py", line 692, in parse
>
>
>
> return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
>
> File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:5
>
> 4187)
>
> File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etre
>
> e.c:79485)
>
> File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lx
>
> ml.etree.c:79768)
>
> File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
>
> tree.c:78843)
>
> File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/
>
> lxml/lxml.etree.c:75698)
>
> File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDo
>
> c (src/lxml/lxml.etree.c:71739)
>
> File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.e
>
> tree.c:72614)
>
> File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etr
>
> ee.c:71927)
>
> IOError: Error reading file 'b'http://frux.wikispaces.com/'': b'failed to load e
>
> xternal entity "http://frux.wikispaces.com/"'
>
>>> >
Double-spacing makes your message much harder to read. I can only
comment in a general way, in any case. most html is mal-formed, and not
legal html. Although I don't have any experience with parsing it, I do
with xml which has similar problems.
The first thing I'd do is to separate the loading of the byte string
from the website, from the parsing of those bytes. Further, I'd make a
local copy of those bytes, so you can do testing repeatably. For
example, you could run wget utility to copy the bytes locally and create
a file.
--
DaveA
More information about the Python-list
mailing list