
On Wed, Jan 10, 2018 at 08:21:47AM +0100, Pedro Andres Aranda Gutierrez wrote:
I have a program written in Python3 that uses lxml. It parses Web pages and creates ebooks out ot them. This is quite handy when transferring bigger manuals to my eReader. If I start from a local copy (i.e. all the files in my harddisk), it works flawlessly - after a lot of tries related to Unicode :-)
But when I try to follow the document structure from a life server (i.e. download using http) it will fail.
I'm in Python3 and using the following libs:
from http.client import HTTPConnection,HTTPSConnection import urllib.request, urllib.error, urllib.parse from urllib.parse import urlparse, urlsplit, urljoin
The main magic is performed with connection.getresponse() and response.read(), response.status and response.data
The error is always:
Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "[1]http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
This looks like you're passing the XML contents to lxml.html.parse() instead of calling lxml.html.fromstring()? Can you show us the actual code? Marius Gedminas -- Committee, n.: A group of men who individually can do nothing but as a group decide that nothing can be done. -- Fred Allen