On Wed, 17 Aug 2022 19:23:02 +0100 MRAB <python@mrabarnett.plus.com> wrote:
I do not like introducing escapes which are not supported in other RE implementations. There is a chance of future conflicts.
Java broke compatibility in Java 8 by redefining \v from a single vertical tab character to the vertical whitespace class. I am not sure that it is a good example that we should follow, because different semantic of \v in raw and non-raw strings is a potential source of bugs. But with special flag which controls the meaning of \v it may be more safe.
Horizontal whitespace can be matched by [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000] in re or [\t\p{Zs}] in regex. Vertical whitespace can be matched by [\n\x0b\f\r\x85\u2028\u2029]. Note that there is a dedicated Unicode category for horizontal whitespaces (excluding the tab itself), but not for vertical whitespaces, it means that vertical whitespaces are less important.
In any case it is simple to introduce special Unicode categories and use \p{ht} and \p{vt} for horizontal and vertical whitespaces.
It's not just Java. Perl supports all 4 of \h, \H, \v and \V. That might be why Java 8 changed. I've found that Perl has \p{HorizSpace} and \p{VertSpace}, so I'm going with that.
+1 for special Unicode categories rather than retargetting existing escapes for something else. (also, matching horizontal/vertical whitespace sounds rather unusual) Regards Antoine.