Whatever I may have said before, I favor supporting the Unicode standard for \w, which is related to the standard for identifiers.

This is one of 2 issues about \w being defined too narrowly.  I am somewhat arbitrarily closing #1693050 as a duplicate of this (fewer digits ;-).

There are 3 issues about tokenize.tokenize failing on valid identifiers, defined as \w sequences whose first char is an identifier itself (and therefore a start char).  In msg313814 of #32987, Serhiy indicates which start and continue identifier characters are matched by \W for re and regex.  I am leaving #24194 open as the tokenizer name issue.

