Re: [lxml-dev] lxml - exslt - regexp:match()
Michael Zeidler wrote:
regexp:match('123abc567','([0-9]+)([a-z]+)([0-9]+)') gibt kein Arry mit den gematchten Gruppen zurück: Wenn ich also mit <xsl:variable name="test" select="regexp:match('123abc567','([0-9]+)([a-z]+)([0-9]+)')"/> die variable $test setzte, müsste ich mit $test[0], $test[1], usw. auf die gematchten gruppen zugreifen können. Siehe http://www.exslt.org/regexp/functions/match/index.html
[translation]:
regexp:match('123abc567','([0-9]+)([a-z]+)([0-9]+)')
does not return an array containing the matched groups. Something like this <xsl:variable name="test" select="regexp:match('123abc567','([0-9]+)([a-z]+)([0-9]+)')"/> should allow me to ask for "$test[0]" etc.
Hmm, interesting. The page doesn't actually say that this is supposed to work. All they provide is an example with a /single/ group. The result of your test case is not defined. For comparison, I now implemented the examples from the page as unit tests, which sadly showed that Python's regexps are incompatible with what EXSLT requires. The Python RE "([a-z])+ " does not match "test " as in EXSLT, only the last "t" is returned for the group by re.findall(). So we can't claim compatibility with EXSLT at this point. -- Note, though, that I never really said it was compatible, it just builds on Python's re module. I still think that's enough for a Python XML library. That said, I fixed your use case in the current trunk, as I think it makes sense to expect the result above from such a call. Note, however, that EXSLT dictates that the first element in a non-global RE result (without 'g' flag) must be the entire string that matched, which even fits the semantics of the group() method in Python's MatchObjects. So your $test[0] will contain "123abc567", $test[1] is "123" etc. Stefan
Stefan Behnel wrote: [snip]
For comparison, I now implemented the examples from the page as unit tests, which sadly showed that Python's regexps are incompatible with what EXSLT requires. The Python RE "([a-z])+ " does not match "test " as in EXSLT, only the last "t" is returned for the group by re.findall(). So we can't claim compatibility with EXSLT at this point. -- Note, though, that I never really said it was compatible, it just builds on Python's re module. I still think that's enough for a Python XML library.
If it's not compatible, I think it should be invoked differently than in the EXSLT way. This way someone dropping in an EXSLT stylesheet with regexes doesn't have a half-working stylesheet but a completely and clearly failing stylesheet: lxml doesn't support the regexes. In addition, the path forward to getting the stylesheet working is clear: use the Python-based and deliberately incompatible regex facility instead, and rewrite the regexes. Regards, Martijn
Hi Martijn, Martijn Faassen wrote:
Stefan Behnel wrote: [snip]
For comparison, I now implemented the examples from the page as unit tests, which sadly showed that Python's regexps are incompatible with what EXSLT requires. The Python RE "([a-z])+ " does not match "test " as in EXSLT, only the last "t" is returned for the group by re.findall(). So we can't claim compatibility with EXSLT at this point. -- Note, though, that I never really said it was compatible, it just builds on Python's re module. I still think that's enough for a Python XML library.
If it's not compatible, I think it should be invoked differently than in the EXSLT way. This way someone dropping in an EXSLT stylesheet with regexes doesn't have a half-working stylesheet but a completely and clearly failing stylesheet: lxml doesn't support the regexes. In addition, the path forward to getting the stylesheet working is clear: use the Python-based and deliberately incompatible regex facility instead, and rewrite the regexes.
Hmmm, I feel invited to disagree here. I reread the EXSLT spec on this topic and it does not contain any RE syntax specification and is rather unclear about what is required for compliance. It says this in the introduction of the RE module: """ For ease of implementation, the regular expressions used in this module currently use the Javascript regular expression syntax. """ while in the description of the functions, it mainly uses this wording: """ The second argument is a regular expression that follows the Javascript regular expression syntax. """ So, the way I read it, the "currently" does not seem to indicate a clear obligation to obey the actual RE syntax used in the spec. Especially the "ease of implementation" calls for a Python 're' implementation in lxml. :) I also believe that people using XML in a Python environment would rather expect regular expressions to be compatible with what they know from Python's re module (where they are pretty well defined) than with JavaScript expressions. So far, the differences only seem to show for repeated groups, so a large area of use cases is even compatible. BTW, the use case given in the EXSLT spec is easily rewritten by moving the RE repeat operator (+/*) into the group, so if portability is really required in this specific case, it can be achieved on the user side. Stefan
participants (2)
-
Martijn Faassen
-
Stefan Behnel