[lxml-dev] problems accessing style tag text
data:image/s3,"s3://crabby-images/befba/befba8b403b067ca269b61035a0bd974a5f80f71" alt=""
Hey, I was trying to get at the contents of a style tag parsed with etree.HTML and it doesn't seem to be accessible as the text attribute of the element. This appears to be related to the insertion of the CDATA marker around the text. Maybe there is something obvious that I'm missing. I tried this against a 1.1 release and a recent checkout of the trunk. I appreciate any advice you all can offer. The following demonstrates what I'm seeing:
Thanks, - Luke
data:image/s3,"s3://crabby-images/c6057/c6057bed8007c428c0e26b11fb68644c69f16b19" alt=""
Hi, thanks for reporting this and for the excellent test case. Luke Tucker wrote:
The thing is that lxml always sets up the parser to convert CDATA sections to normal text nodes, which simplifies the internal handling of text quite a bit. However, the HTML parser does not have such a parser option, so CDATA sections slip through in this case. Especially, as the libxml2 HTML parser generates them explicitly for script content and CSS. I consider this a bug that should be fixed. It looks like we can prevent the CDATA generation by modifying the SAX parser function table of libxml2 in place (set the cdataBlock function to NULL). That's how libxml2 handles the XML_PARSE_NOCDATA flag internally and it seems to work for HTML just fine. I'll commit it to both the trunk and the 1.1 branch. Stefan
data:image/s3,"s3://crabby-images/c6057/c6057bed8007c428c0e26b11fb68644c69f16b19" alt=""
Hi, thanks for reporting this and for the excellent test case. Luke Tucker wrote:
The thing is that lxml always sets up the parser to convert CDATA sections to normal text nodes, which simplifies the internal handling of text quite a bit. However, the HTML parser does not have such a parser option, so CDATA sections slip through in this case. Especially, as the libxml2 HTML parser generates them explicitly for script content and CSS. I consider this a bug that should be fixed. It looks like we can prevent the CDATA generation by modifying the SAX parser function table of libxml2 in place (set the cdataBlock function to NULL). That's how libxml2 handles the XML_PARSE_NOCDATA flag internally and it seems to work for HTML just fine. I'll commit it to both the trunk and the 1.1 branch. Stefan
participants (3)
-
Luke Tucker
-
Martijn Faassen
-
Stefan Behnel