null bytes in re pattern - difference between 1.5.2 and 2.0?
Tim Peters
tim.one at home.com
Wed Dec 13 22:32:36 EST 2000
[posted and mailed]
[Skip Montanaro]
> I want to delete control characters from some strings. Accordingly, I
> tried:
>
> name = re.sub("[\000-\037\177]", "", name)
>
> This works in Python 2.0 but not in 1.5.2. In 1.5.2 I find I need
> to use raw strings:
That's a good idea in 2.0 too, y'know.
> name = re.sub(r"[\000-\037\177]", "", name)
>
> Accordingly, using raw strings in 2.0 fails.
Eh? Prove it. That is, submit a bug report with a specific failing example
if that's true. Works for me:
Python 2.0 (#8, Oct 16 2000, 17:27:58) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.
IDLE 0.6 -- press F1 for help
>>> import re
>>> allchars = [chr(i) for i in range(256)]
>>> fat = "".join(allchars)
>>> print len(fat)
256
>>> skinny = re.sub(r"[\000-\037\177]", "", fat)
>>> print len(skinny)
223
>>> 256 - 223
33
>>>
> Is there some form that will work both in 1.5.2 and 2.0?
The r-string form. Or use the optional deletechars argument to
string.translate, which should run much faster in either version.
> Is this a change I should have expected?
No.
> I assume it has something to do with Unicode support in 2.0.
Much more mundane than that: 1.5.2 used a 3rd-party regexp engine (PCRE),
and its interface required passing in the pattern as a regular old C string.
So you couldn't pass a pattern with a literal null byte in 1.5.2.
ghosts-chasing-ghosts-ly y'rs - tim
More information about the Python-list
mailing list