Mailman 3 January 2018 - lxml - The Python XML Toolkit

[lxml-dev] lxml has its page on launchpad
by Stefan Behnel 11 Apr '23

11 Apr '23

Hi all, I added the lxml project to launchpad, the Ubuntu Bug-Tracker. It also has a FAQ engine and a couple of other goodies. https://launchpad.net/lxml It's easy to sign up for launchpad, BTW, no 90%-footnotes-contract. Have fun, Stefan

9 9

[lxml-dev] Checking whether a node is a comment/element
by Geoffrey Sneddon 10 Apr '23

10 Apr '23

Hi, What's the best way to check whether a given node is a comment or an element? For the former, I'm currently using isinstance(node, etree._Comment), which is rather obviously sub-optimal. -- Geoffrey Sneddon <http://gsnedders.com/>

6 6

[lxml-dev] Reparenting a node
by Lawrence Oluyede 30 Jan '23

30 Jan '23

I have a doc A and a doc B, I'd like to put a node extracted from A in the document B but I always get a ValueError: ValueError: Element is not a child of this node. I didn't find any "setparent" in the API. How can I do this? -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair

3 2

[lxml-dev] lxml 2.0.5 released
by Stefan Behnel 11 Jan '23

11 Jan '23

Hi all, lxml 2.0.5 is on PyPI. This is a bug-fix-only release of the stable 2.0 series. Have fun, Stefan 2.0.5 (2008-05-01) Bugs fixed * Resolving to a filename in custom resolvers didn't work. * lxml did not honour libxslt's second error state "STOPPED", which let some XSLT errors pass silently. * Memory leak in Schematron with libxml2 >= 2.6.31.

3 4

[lxml-dev] Building LXML Trunk
by Sidnei da Silva 31 Aug '22

31 Aug '22

Hi, I've tried to build lxml from trunk today, on Win32. Got the following error: src\lxml\etree.c(880) : error C2059: syntax error : ')' src\lxml\etree.c(881) : error C2059: syntax error : ')' src\lxml\etree.c(882) : error C2059: syntax error : ')' src\lxml\etree.c(883) : error C2059: syntax error : ')' Any clue? Smells like a Pyrex issue? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

4 4

lxml and pypy
by Martin Mueller 29 Jan '18

29 Jan '18

Does lxml run under pypy and would it make a difference to my project? I looked at pypy where you learn that it can be a lot faster in some circumstances. Would it help me? I run quite primitive lxml scripts across very large data sets, in particular 50,000 Early Modern texts that have been linguistically annotated so that every token is a <w> element with a set of attributes. There are a lot of errors in the original annotation, and I use various heuristics to spot errors and correct them, which mainly involves changing @lemma, @pos and @reg attributes. The texts vary in length from 100K to 250MB. It appears to me that building the document tree is the most expensive operation in the enterprise. If you have an error with 1,000 occurrences but you don't know the texts in which they occur you have to run the script across the entire set. That's an operation that takes between six and eight hours. So you don't want to run it unless you've gathered a lot of errors. Shaving a quarter off that running time wouldn't make much difference. Cutting it in half would be well worth it. I haven't experimented with running things concurrently. I use Pycharm and cut theoretically do two concurrent runs, dividing the texts into two groups of 25,000. I have a Mac with 32 GB of memory and a four core 4 GHz i7 processor. I don't know enough about the inside of machines to figure out whether the two processes would just get in each other's way. I'll be grateful for any advice. Martin Mueller Professor emeritus of English and Classics Northwestern University

4 6

lxml.etree.XMLSyntaxError: Namespace prefix * for * on * is not defined, line *, column *
by Peng Yu 26 Jan '18

26 Jan '18

I got the following error when I process the following xml input. Is there a way to let lxml ignore the error. lxml.etree.XMLSyntaxError: Namespace prefix xsi for nil on ED_INST_TYPE is not defined, line 1, column 254 <row><APPLICATION_ID>9347073</APPLICATION_ID><ACTIVITY>DP2</ACTIVITY><ADMINISTERING_IC>NS</ADMINISTERING_IC><APPLICATION_TYPE>1</APPLICATION_TYPE><ARRA_FUNDED>N</ARRA_FUNDED><AWARD_NOTICE_DATE>09/13/2017</AWARD_NOTICE_DATE><BUDGET_START>09/30/2017</BUDGET_START><BUDGET_END>06/30/2022</BUDGET_END><CFDA_CODE>853</CFDA_CODE><CORE_PROJECT_NUM>DP2NS106663</CORE_PROJECT_NUM><ED_INST_TYPE xsi:nil='true'/><FOA_NUMBER>RFA-RM-16-004</FOA_NUMBER><FULL_PROJECT_NUM>1DP2NS106663-01</FULL_PROJECT_NUM></row> -- Regards, Peng

2 1

strange segfault when fromstring() is used
by Alexandre Kandalinsev 16 Jan '18

16 Jan '18

Good afternoon, I observe a very strange behavior: the following code segfaults. Looks like resource limit kicks in (because huge_tree=True solves the problem). But there is a strange thing: commenting out etree.fromstring() solves the problem. Is this a bug? BTW, creating a new parser in the loop does not help. May be parser should be closed/deinitialized somehow? Here is the code (mirror: http://dpaste.com/13G071K): #!/usr/bin/env python from lxml import etree import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) parser= etree.XMLParser( # huge_tree=True ) # Bug? Without it all works fine etree.fromstring('<x></x>', parser=parser) tag = etree.QName('tag') value = etree.QName('id') for i in range(1, 1_000_000): # this doesn't help at all, parsers share resource counters? parser = etree.XMLParser() x = parser.makeelement(tag) rand_id = id_generator() x.set(f"x_{rand_id*100}", value) for c in x.iter(): c.attrib.items() -- Kind Regards, Alexandre

1 0

What is the correct way to avoid the W3C DTD _not_ to be served
by Pedro Andres Aranda Gutierrez 13 Jan '18

13 Jan '18

Hi folks I have a program written in Python3 that uses lxml. It parses Web pages and creates ebooks out ot them. This is quite handy when transferring bigger manuals to my eReader. If I start from a local copy (i.e. all the files in my harddisk), it works flawlessly - after a lot of tries related to Unicode :-) But when I try to follow the document structure from a life server (i.e. download using http) it will fail. I'm in Python3 and using the following libs: from http.client import HTTPConnection,HTTPSConnection import urllib.request, urllib.error, urllib.parse from urllib.parse import urlparse, urlsplit, urljoin The main magic is performed with connection.getresponse() and response.read(), response.status and response.data The error is always: Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> Thanks for any help -- Fragen sind nicht da um beantwortet zu werden, Fragen sind da um gestellet zu werden Georg Kreisler

3 3