Maybe allow br"" or rb"" e.g., for bytes regexes in Py3?

Hi,
Python 3 has two string prefixes r"" for raw strings and b"" for bytes.
So if you want to create a regex based on bytes as far as I can tell you have to do something like this:
FONTNAME_RE = re.compile(r"/FontName\s+/(\S+)".encode("ascii")) # or FONTNAME_RE = re.compile(b"/FontName\\s+/(\\S+)")
I think it would be much nicer if one could write:
FONTNAME_RE = re.compile(br"/FontName\s+/(\S+)") # or FONTNAME_RE = re.compile(rb"/FontName\s+/(\S+)")
I _slightly_ prefer rb"" to br"" but either would be great:-)
Why would you want a bytes regex?
In my case I am reading PostScript files and PostScript .pfa font files so that I can embed the latter into the former. But I don't know what encoding these files use beyond the fact that it is ASCII or some ASCII superset like Latin1. So in true Python style I don't assume: instead I read the files as bytes and do all my processing using bytes, at no point decoding since I only ever insert ASCII characters. I don't think this is a rare example: with Python 3's clean separation between strings & bytes (a major advance IMO), I think there will often be cases where all the processing is done using bytes.

On Tue, Jun 29, 2010 at 6:20 PM, Mark Summerfield mark@qtrac.eu wrote:
FONTNAME_RE = re.compile(br"/FontName\s+/(\S+)") # or FONTNAME_RE = re.compile(rb"/FontName\s+/(\S+)")
I _slightly_ prefer rb"" to br"" but either would be great:-)
According to my local build, we already picked 'br':
Python 3.2a0 (py3k:81943, Jun 12 2010, 22:02:56) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
"\t"
'\t'
r"\t"
'\\t'
b"\t"
b'\t'
br"\t"
b'\\t'
I installed the system python3 to confirm that this isn't new:
Python 3.1.2 (r312:79147, Apr 15 2010, 15:35:48) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
br"\t"
b'\\t'
Cheers, Nick.

You're right, so I've raised it as a doc bug: http://bugs.python.org/issue9114
On 2010-06-29, Nick Coghlan wrote:
On Tue, Jun 29, 2010 at 6:20 PM, Mark Summerfield mark@qtrac.eu wrote:
FONTNAME_RE = re.compile(br"/FontName\s+/(\S+)") # or FONTNAME_RE = re.compile(rb"/FontName\s+/(\S+)")
I _slightly_ prefer rb"" to br"" but either would be great:-)
According to my local build, we already picked 'br':
Python 3.2a0 (py3k:81943, Jun 12 2010, 22:02:56) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
"\t"
'\t'
r"\t"
'\\t'
b"\t"
b'\t'
br"\t"
b'\\t'
I installed the system python3 to confirm that this isn't new:
Python 3.1.2 (r312:79147, Apr 15 2010, 15:35:48) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
br"\t"
b'\\t'
Cheers, Nick.

On Tue, Jun 29, 2010 at 6:07 PM, Greg Ewing greg.ewing@canterbury.ac.nz wrote:
Nick Coghlan wrote:
According to my local build, we already picked 'br':
Wouldn't "raw bytes" sound better than "bytes raw"? Or do the Dutch say it differently? :-)
I can pronounce "brrrrr" but I can't say "rrrrrb". :-)

Guido van Rossum wrote:
On Tue, Jun 29, 2010 at 6:07 PM, Greg Ewing greg.ewing@canterbury.ac.nz wrote:
Nick Coghlan wrote:
According to my local build, we already picked 'br':
Wouldn't "raw bytes" sound better than "bytes raw"? Or do the Dutch say it differently? :-)
I can pronounce "brrrrr" but I can't say "rrrrrb". :-)
And, of course, Python 2 has 'ur', but not 'ru'.

On 6/29/2010 10:04 PM, MRAB wrote:
Guido van Rossum wrote:
On Tue, Jun 29, 2010 at 6:07 PM, Greg Ewing greg.ewing@canterbury.ac.nz wrote:
Nick Coghlan wrote:
According to my local build, we already picked 'br':
Wouldn't "raw bytes" sound better than "bytes raw"? Or do the Dutch say it differently? :-)
I can pronounce "brrrrr" but I can't say "rrrrrb". :-)
And, of course, Python 2 has 'ur', but not 'ru'.
Even though most say or think 'raw unicode' rather than 'unicode raw'. But ur and br strike me as logically correct. In both Py2 and Py3, string literals are str literals. The r prefix disables most of the cooking of the literal. The u and b prefixes are effectively abbreviations for unicode() and bytes() calls on, I presume, the buffer part of a partially formed str object. In other words, br'abc' has the same effect as bytes(r'abc') but is easier to write and, I presume, faster to compute.
It it easy for people who only use ascii chars in Python code to forget that Python3 code is now actually a sequence of unicode chars rather than of (extended) ascii chars.

Mark Summerfield writes:
Python 3 has two string prefixes r"" for raw strings and b"" for bytes.
And you *can* combine them, but it needs to be in the right order (although I'm not sure that's intentional):
steve@uwakimon ~ $ python3.1 Python 3.1.2 (release31-maint, May 12 2010, 20:15:06) [GCC 4.3.4] on linux2 Type "help", "copyright", "credits" or "license" for more information.
rb"a\rc"
File "<stdin>", line 1 rb"a\rc" ^ SyntaxError: invalid syntax
br"abc"
b'abc'
br"a\rc"
b'a\\rc'
Watch out for that time machine!
participants (7)
-
Greg Ewing
-
Guido van Rossum
-
Mark Summerfield
-
MRAB
-
Nick Coghlan
-
Stephen J. Turnbull
-
Terry Reedy