What is the correct way to avoid the W3C DTD _not_ to be served
Hi folks I have a program written in Python3 that uses lxml. It parses Web pages and creates ebooks out ot them. This is quite handy when transferring bigger manuals to my eReader. If I start from a local copy (i.e. all the files in my harddisk), it works flawlessly - after a lot of tries related to Unicode :-) But when I try to follow the document structure from a life server (i.e. download using http) it will fail. I'm in Python3 and using the following libs: from http.client import HTTPConnection,HTTPSConnection import urllib.request, urllib.error, urllib.parse from urllib.parse import urlparse, urlsplit, urljoin The main magic is performed with connection.getresponse() and response.read(), response.status and response.data The error is always: Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> Thanks for any help -- Fragen sind nicht da um beantwortet zu werden, Fragen sind da um gestellet zu werden Georg Kreisler
Pedro Andres Aranda Gutierrez schrieb am 10.01.2018 um 08:21:
I have a program written in Python3 that uses lxml. It parses Web pages and creates ebooks out ot them. This is quite handy when transferring bigger manuals to my eReader. If I start from a local copy (i.e. all the files in my harddisk), it works flawlessly - after a lot of tries related to Unicode :-)
But when I try to follow the document structure from a life server (i.e. download using http) it will fail.
I'm in Python3 and using the following libs:
from http.client import HTTPConnection,HTTPSConnection import urllib.request, urllib.error, urllib.parse from urllib.parse import urlparse, urlsplit, urljoin
The main magic is performed with connection.getresponse() and response.read(), response.status and response.data
The error is always:
Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
That's not a file path, so I'll assume that you didn't copy the error message exactly. My guess is that your server lacks XML catalogue files. http://xmlsoft.org/catalog.html https://stackoverflow.com/questions/7228583/python-lxml-catalog-lookup BTW, lxml does not currently provide an API for the catalogue support in libxml2 (e.g. extending or directly querying it). Should be easy to implement, though. Pull request welcome. Stefan
On Wed, Jan 10, 2018 at 08:21:47AM +0100, Pedro Andres Aranda Gutierrez wrote:
I have a program written in Python3 that uses lxml. It parses Web pages and creates ebooks out ot them. This is quite handy when transferring bigger manuals to my eReader. If I start from a local copy (i.e. all the files in my harddisk), it works flawlessly - after a lot of tries related to Unicode :-)
But when I try to follow the document structure from a life server (i.e. download using http) it will fail.
I'm in Python3 and using the following libs:
from http.client import HTTPConnection,HTTPSConnection import urllib.request, urllib.error, urllib.parse from urllib.parse import urlparse, urlsplit, urljoin
The main magic is performed with connection.getresponse() and response.read(), response.status and response.data
The error is always:
Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "[1]http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
This looks like you're passing the XML contents to lxml.html.parse() instead of calling lxml.html.fromstring()? Can you show us the actual code? Marius Gedminas -- Committee, n.: A group of men who individually can do nothing but as a group decide that nothing can be done. -- Fred Allen
Hi Marius I think you gave me the right answer. Thx a ton Enviado desde mi iPhone
El 12 ene 2018, a las 20:05, Marius Gedminas <marius@gedmin.as> escribió:
On Wed, Jan 10, 2018 at 08:21:47AM +0100, Pedro Andres Aranda Gutierrez wrote: I have a program written in Python3 that uses lxml. It parses Web pages and creates ebooks out ot them. This is quite handy when transferring bigger manuals to my eReader. If I start from a local copy (i.e. all the files in my harddisk), it works flawlessly - after a lot of tries related to Unicode :-)
But when I try to follow the document structure from a life server (i.e. download using http) it will fail.
I'm in Python3 and using the following libs:
from http.client import HTTPConnection,HTTPSConnection import urllib.request, urllib.error, urllib.parse from urllib.parse import urlparse, urlsplit, urljoin
The main magic is performed with connection.getresponse() and response.read(), response.status and response.data
The error is always:
Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "[1]http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
This looks like you're passing the XML contents to lxml.html.parse() instead of calling lxml.html.fromstring()?
Can you show us the actual code?
Marius Gedminas -- Committee, n.: A group of men who individually can do nothing but as a group decide that nothing can be done. -- Fred Allen _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
participants (3)
-
Marius Gedminas
-
Pedro Andres Aranda Gutierrez
-
Stefan Behnel