raw strings

Sat Oct 12 01:05:14 EDT 2002

On 11 Oct 2002 06:28:52 -0700, mis6 at pitt.edu (Michele Simionato) wrote:

>>... So I'm
>> curious what use you have in mind.
>> 
>> Regards,
>> Bengt Richter
>
>
>For nearly all purposes, Python is the language which fits my mind the best.
>However, there is an exception: regular expressions.
>
>Python regular expressions are UGLY. 
>
>Suppose for instance I want to substitute regexp1 with regexp2 in a text: 
>in sed or perl I would give a command like
>
>s/regexp1/regexp2/
>
>In Python I must write 
>
>import re
>re.compile(r'regexp1').sub(r'regexp2',text)
>
>which to my mind is ugly. Therefore I wanted to define a nicer substitution
>command like this:
>
>def sub(regexp1,regexp2,text):
>    import re
>    r1=raw_string(regexp1)
>    r2=raw_string(regexp2)
>    return re.compile(r1).sub(r2,text)
>
>For this to work I need a raw_string function such that
>
>raw_string('regexp')==r'regexp' 
>
Actually, it could be made to work just the way you defined sub above,
but not with the raw_string you're thinking of ;-)
But it makes it easier if we cut out a few steps and write something like:
(BTW I returned values for test purposes, so the normal return isn't reached)
(I haven't tested the actual sub function, but it should work as if the
strings passed had r prefixes. FLW ;-) No guarantees. And use of the
enclosed code is emphatically advised AGAINST, except as a learning exercise ;-)

----< simoniato.py >----------------------
# simoniato.py
def sub(regexp1,regexp2,text):
    import re
    from rawparam import rawparam
    r1=rawparam(0)
    r2=rawparam(1)
    return '%s %s %s' % (r1, r2, `text`) #XXX for test
    r1 = eval(r1)   # pass on the value to re as if you had written r'...'
    r2 = eval(r2)   # ... so we could do it more efficiently
    return re.compile(r1).sub(r2,text)

def test():
    print sub('hel\\lo', 'hel\lo', 'hel\\lo hel\lo')
    print sub('\1','\x01', '\1 \x01')
    print sub('\01','\001', '\01 \001')
    print sub(chr(1), '??', 'chr(1) ??')

if __name__ == '__main__':
    test()
------------------------------------------

Which prints this if we run it:
[21:43] C:\pywk\ng>python simoniato.py
r'hel\\lo' r'hel\lo' 'hel\\lo hel\\lo'
r'\1' r'\x01' '\x01 \x01'
r'\01' r'\001' '\x01 \x01'
Traceback (most recent call last):
  File "simoniato.py", line 19, in ?
    test()
  File "simoniato.py", line 16, in test
    print sub(chr(1), '??', 'chr(1) ??')
  File "simoniato.py", line 5, in sub
    r1=rawparam(0)
  File "rawparam.py", line 74, in rawparam
    raise ValueError, 'Literal string constant required, not %s.' % `param`
ValueError: Literal string constant required, not 'chr ( 1 )'.

But don't get your hopes up that there's a clean solution. The secret is
a real hack, using the inspect module to sneak a look back at the source code
of the call and extract the string from line with the call, and using the tokenizer
to pick apart that line and find the actual function call that matches the name
of the routine within which the rawparam call is made. The call must be on a single
line and there must only be one call to that routine on the line, and the parameter
you're asking for must be a string literal in the calling line (otherwise we couldn't
get its source ;-) Also there must be a source file, so interactive calls won't fly
unless you import something that does the actual call.

----< rawparam.py >-----------------------
# rawparam.py 
# get nth parameter written as string as if written as r' string
def rawparam(n):
    import re
    import inspect, tokenize, token, StringIO
    cframe = inspect.currentframe().f_back
    cname = cframe.f_code.co_name
    ccframe = cframe.f_back
    ccinfo = inspect.getframeinfo(ccframe)
    callingline = ccinfo[3][0]
    tnt = [(token.tok_name[t[0]],t[1]) for t in
             tokenize.generate_tokens(StringIO.StringIO(callingline).readline)]
    #print callingline,'\n',tnt
    gotcall = 0     # 1==got cname, 2== got cname(
    paramtoks=[]
    bad=[]; parens = 0
    while tnt:
        toknam, tokval = tnt.pop(0)
        #print gotcall, toknam, tokval, paramtoks, bad, parens
        if gotcall == 0:
            if  toknam=='NAME' and tokval==cname:
                gotcall = 1
                continue
        if gotcall == 1:  # looking for '('
            if  toknam=='OP' and tokval=='(':
                parens = 1
                gotcall = 2
            else:
                gotcall = 0   #not the call we're looking for
                continue
        elif gotcall==2:    # looking for simple string arg
            if toknam=='STRING':
                paramtoks.append(tokval)
                gotcall = 3 # look for immmediate comma or rparen
                continue
            elif toknam=='OP' and tokval=='(':
                parens = 1
            bad.append(tokval)
            gotcall=4       # accum bad until , or )
        elif gotcall==3:    # looking for OP [,)] after string token
            if toknam=='OP':
                if tokval==')': parens -= 1
                elif tokval=='(':
                    parens += 1
                if tokval==',':
                    gotcall=2   # good, get another
                    continue
                if not parens and tokval==')': break # good ending
            if paramtoks: bad.append(paramtoks.pop())   # string not by itself
            bad.append(tokval)
            gotcall=4       # accum bad until , or )
        elif gotcall==4:     # accum bad until , or )
            if toknam =='OP':
                if tokval==')': parens -= 1
                elif tokval=='(':
                    parens += 1
                if  parens <=1 and (tokval==',' or tokval==')'):
                    bad.append(tokval)
                    paramtoks.append(' '+' '.join(bad))
                    bad = []
                    gotcall=2   # try for normal arg
                    if not parens and tokval==')': break
                    continue
            bad.append(tokval)
            # fall through to continue
    #print paramtoks, bad
    if len(paramtoks)>n:
        param = paramtoks[n]
    else:
        raise ValueError, 'Not enough parameters in call: %s' % callingline.strip()
    if param[0]!='"' and param[0]!="'":
        if param[0]=='r' and param[1] in ['"',"'"]: return param # already raw string
        if param[0]==' ': param=param[1:]   #strip our error marker
        raise ValueError, 'Literal string constant required, not %s.' % `param`
    return 'r'+param    # make raw

# test
def raw_string(s):
    return rawparam(0)

def test():
    parametername = 'parametervalue' # make ready for #7
    for testno in range(9):
        try:
            if   testno==0: print raw_string('hel\\lo')
            elif testno==1: print raw_string('hel\lo')
            elif testno==2: print raw_string('hel\\lo')
            elif testno==3: print raw_string('\1')
            elif testno==4: print raw_string("""hel\\lo hel\lo \1 ' '' \` \" '''""")
            elif testno==5: print raw_string(r'already raw')
            elif testno==6: print raw_string('no expressions'*2)
            elif testno==7: print raw_string(parametername)
            elif testno==8: print raw_string(chr(1)*2)
        except ValueError,e:
            print 'ValueError: %s' % e

if __name__ == '__main__':
    test()
------------------------------------------
Which if we run it:

[21:43] C:\pywk\ng>python rawparam.py
r'hel\\lo'
r'hel\lo'
r'hel\\lo'
r'\1'
r"""hel\\lo hel\lo \1 ' '' \` \" '''"""
r'already raw'
ValueError: Literal string constant required, not "'no expressions' * 2 )".
ValueError: Literal string constant required, not 'parametername )'.
ValueError: Literal string constant required, not 'chr ( 1 )'.

Now don't go using this for anything real ;-)

>Of course I could define the simpler
>
>def sub(r1,r2,text):
>    import re
>    return re.compile(r1).sub(r2,text)
>
>to be invoked as 
>
>sub(r'regexp1',r'regexp2',text)
>
>but to me the r in front of the string is ugly. This of course is only
>an aesthetic issue, but I don't like to be forced to put the r in front
>of raw strings, in the same way as I don't like languages where you
>are forced to put additional symbols in the names of variables to specify 
>their type. This is the reason why I would be happy to have a built-in 
>raw_string function available.
>Is there somebody else who thinks like me ?
>
Well, I am not completely happy with the string syntax, raw or not. I would like an option
to choose my own delimiter dynamically, and have raw ignore *everything* except a closing delimiter.
It's been  discussed before, but the fact that you can't quote unmodified arbitrary text
doesn't seem to bother anyone, since triple quoting covers 99+% of cases.

This was just a kind of interesting exercise. You still have to write a legal string
in the call (e.g., you can't end a string like this: 'no no\' )

We could get around that by moving the text over into a comment on the calling line.
Then we could retrieve arbitrarily formatted strings and really make some people aghast ;-)

But now that we have a proof of principle, can we just use the r prefix ;-)

Regards,
Bengt Richter