python3 raw strings and \u escapes
wxjmfauth at
Thu May 31 01:43:10 EDT 2012
On 30 mai, 08:52, "ru... at" <ru... at> wrote:
> In python2, "\u" escapes are processed in raw unicode
> strings. That is, ur'\u3000' is a string of length 1
> consisting of the IDEOGRAPHIC SPACE unicode character.
> In python3, "\u" escapes are not processed in raw strings.
> r'\u3000' is a string of length 6 consisting of a backslash,
> 'u', '3' and three '0' characters.
> This breaks a lot of my code because in python 2
> re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
> but in python 3 (the result of running 2to3),
> re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']
> I can remove the "r" prefix from the regex string but then
> if I have other regex backslash symbols in it, I have to
> double all the other backslashes -- the very thing that
> the r-prefix was invented to avoid.
> Or I can leave the "r" prefix and replace something like
> r'[ \u3000]' with r'[ ]'. But that is confusing because
> one can't distinguish between the space character and
> the ideographic space character. It also a problem if a
> reader of the code doesn't have a font that can display
> the character.
> Was there a reason for dropping the lexical processing of
> \u escapes in strings in python3 (other than to add another
> annoyance in a long list of python3 annoyances?)
> And is there no choice for me but to choose between the two
> poor choices I mention above to deal with this problem?
I suggest to take the problem differently. Python 3
succeeded to put order in the missmatch of the "coding
of the characters" Python 2 was proposing.
"characters" (in fact unicode code points) are just (normal)
"characters". The backslash, used as an escaping command,
keeps its function.
Note the absence of r'...'
>>> s = 'a\u3000é\u3000€'
>>> s.split('\u3000')
['a', 'é', '€']
>>> import re
>>> re.split('\u3000', s)
['a', 'é', '€']
>>> s = 'a\\b\\c'
>>> print(s)
>>> s.split('\\')
['a', 'b', 'c']
>>> re.split('\\\\', s)
['a', 'b', 'c']
>>> hex(ord('\\'))
>>> re.split('\u005c\u005c', s)
['a', 'b', 'c']
More information about the Python-list
mailing list