[Python-3000] bytes regular expression?

Victor Stinner victor.stinner at haypocalc.com
Thu Aug 9 17:40:58 CEST 2007


Hi,

On Thursday 09 August 2007 06:07:12 Guido van Rossum wrote:
> A quick temporary hack is to use buffer(b'abc') instead. (buffer() is
> so incredibly broken that it lets you hash() even if the underlying
> object is broken. :-)

I prefer str8 which looks to be a good candidate for "frozenbytes" type.

> The correct solution is to fix the re library to avoid using hash()
> directly on the underlying data type altogether; that never had sound
> semantics (as proven by the buffer() hack above).

re module uses a dictionary to store compiled expressions and the key is a 
tuple (pattern, flags) where pattern is a bytes (str8) or str and flags is an 
int.

re module bugs:
 1. _compile() doesn't support bytes
 2. escape() doesn't support bytes

My attached patch fix both bugs:
 - convert bytes to str8 in _compile() to be able to hash it
 - add a special version of escape() for bytes

I don't know the best method to create a bytes in a for. In Python 2.x, the 
best method is to use a list() and ''.join(). Since bytes is mutable I 
choosed to use append() and concatenation (a += b).

I also added new unit test for escape() function with bytes argument.

You may not apply my patch directly. I don't know Python 3000 very well nor 
Python coding style. But my patch should help to fix the bugs ;-)

-----

Why re module has code for Python < 2.2 (optional finditer() function)? Since 
the code is now specific to Python 3000, we should use new types like set 
(use a set for _alphanum instead of a dictionary) and functions like 
enumerate (in _escape for str block).

Victor Stinner
http://hachoir.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: py3k-struni-re.diff
Type: text/x-diff
Size: 3440 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070809/23e53b6a/attachment-0001.bin 


More information about the Python-3000 mailing list