Possible bug with lxml.etree.fromstring('<value>]]></value>').text

Hi all, I get an error when I try to parse a text element like this: <value>]]></value>
I'm not an expert, but from: http://www.w3.org/TR/REC-xml/#NT-AttValue AttValue ::= '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" which I read as: Any Reference character is valid, except & and <, which are used for escaping and closing the element. The sequence <value>]]></value> also validates as well-formed at http://www.xmlvalidation.com/ The sequence <value>]></value> parses OK (So, it's only with a double ] and > ) It's probably related to parsing <![CDATA[ ... ]]> (i.e. I guess when the parser detects ]]> it assumes / requires the state of <![CDATA[ which is, of course, not true) The sequence <value><![CDATA[foo]]></value> is parsed correctly:
lxml.etree.fromstring('<value><![CDATA[foo]]></value>').text 'foo'
Any ideas whether this is really a bug? - Kees

On 16.08.2013, at 18:52, Kees Bos <k.bos@capitar.com> wrote: Hi Kees,
lxml.etree.fromstring('<value>]]></value>').text ==> Sequence ']]>' not allowed in content, line 1, column 8
AttValue is about Attributes:
etree.fromstring('<value x="]]>"></value>').get('x') ']]>’
In Character Data, in ]]> the > must be escaped (see 2.4 in the spec) if it does not end a CDATA section:
etree.fromstring('<value>]]></value>').text ']]>’
jens

On 16.08.2013, at 18:52, Kees Bos <k.bos@capitar.com> wrote: Hi Kees,
lxml.etree.fromstring('<value>]]></value>').text ==> Sequence ']]>' not allowed in content, line 1, column 8
AttValue is about Attributes:
etree.fromstring('<value x="]]>"></value>').get('x') ']]>’
In Character Data, in ]]> the > must be escaped (see 2.4 in the spec) if it does not end a CDATA section:
etree.fromstring('<value>]]></value>').text ']]>’
jens
participants (2)
-
Jens Quade
-
Kees Bos