[regex] case-splitting strings in unicode
mde at micah.elliott.name
Sun Oct 9 03:25:16 CEST 2005
On Oct 09, John Perks and Sarah Mount wrote:
> I have to split some identifiers that are casedLikeThis into their
> component words. In this instance I can safely use [A-Z] to represent
> uppercase, but what pattern should I use if I wanted it to work more
> generally? I can envisage walking the string testing the
> unicodedata.category of each char, but is there a regex'y way to
> denote "uppercase"?
Not sure what your output should look like but something like this could
>>> import re
>>> re.sub(r'([A-Z])', r' \1', 'theFirstTest theSecondTest')
'the First Test the Second Test'
This can be adapted for multiline, etc, but maybe '[A-Z]' is
sufficiently general. The regex module does have an understanding of
unicode (but I don't, sorry); you could add (?u) make it unicode aware.
For programming language identifiers I wouldn't think that unicode
should be an issue. Sorry I'm no help with unicode specifics.
Some useful links:
<mde at micah.elliott.name>
More information about the Python-list