Possible bug with lxml.etree.fromstring('<value>]]></value>').text

Hi all, I get an error when I try to parse a text element like this: <value>]]></value>
lxml.etree.fromstring('<value>]]></value>').text Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lxml.etree.pyx", line 2756, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54726) File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82843) File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81641) File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78311) File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74567) File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75458) File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74791) lxml.etree.XMLSyntaxError: Sequence ']]>' not allowed in content, line 1, column 8
I'm not an expert, but from: http://www.w3.org/TR/REC-xml/#NT-AttValue AttValue ::= '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" which I read as: Any Reference character is valid, except & and <, which are used for escaping and closing the element. The sequence <value>]]></value> also validates as well-formed at http://www.xmlvalidation.com/ The sequence <value>]></value> parses OK (So, it's only with a double ] and > ) It's probably related to parsing <![CDATA[ ... ]]> (i.e. I guess when the parser detects ]]> it assumes / requires the state of <![CDATA[ which is, of course, not true) The sequence <value><![CDATA[foo]]></value> is parsed correctly:
lxml.etree.fromstring('<value><![CDATA[foo]]></value>').text 'foo'
Any ideas whether this is really a bug? - Kees

On 16.08.2013, at 18:52, Kees Bos <k.bos@capitar.com> wrote: Hi Kees,
lxml.etree.fromstring('<value>]]></value>').text ==> Sequence ']]>' not allowed in content, line 1, column 8
AttValue is about Attributes:
etree.fromstring('<value x="]]>"></value>').get('x') ']]>’
In Character Data, in ]]> the > must be escaped (see 2.4 in the spec) if it does not end a CDATA section:
etree.fromstring('<value>]]></value>').text ']]>’
jens
participants (2)
-
Jens Quade
-
Kees Bos