[issue25743] Clarify exactly what \w matches in UNICODE mode
New submission from Zack Weinberg: The `re` module documentation does not do a good job of explaining exactly what `\w` matches. Quoting https://docs.python.org/3.5/library/re.html :
\w For Unicode (str) patterns: Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.
Empirically, this appears to mean "everything in Unicode general categories L* and N*, plus U+005F (underscore)". That is a perfectly sensible definition and the documentation should state it in those terms. "Unicode word characters" could mean any number of different things; note for instance that UTS#18 gives a very different definition.
(Further reading: https://gist.github.com/zackw/3077f387591376c7bf67 plus links therefrom).
----------
assignee: docs@python
components: Documentation
messages: 255463
nosy: docs@python, zwol
priority: normal
severity: normal
status: open
title: Clarify exactly what \w matches in UNICODE mode
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6
_______________________________________
Python tracker
Andi McClure added the comment:
I would like to request also a clear explanation be given for the documentation in the 2.7 branch. From https://docs.python.org/2.7/library/re.html :
"\w ... If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database"
This is ambiguous. Does it mean the "Alphabetic" property from UAX#44? Does it mean something else?
----------
nosy: +Andi McClure
_______________________________________
Python tracker
Zack Weinberg added the comment:
FWIW, the actual behavior of \w matching "everything in Unicode general categories L* and N*, plus U+005F (underscore)" is consistent across all versions I can conveniently test (2.7, 3.4, 3.5).
In 2.7, there are four characters in general category Nl that \w doesn't match, but I believe that is just a bug, not an intentional difference of behavior.
----------
_______________________________________
Python tracker
Changes by Ezio Melotti
participants (3)
-
Andi McClure
-
Ezio Melotti
-
Zack Weinberg