[Tutor] RE module is working ?

Fri Feb 4 20:07:24 CET 2011

On 02/04/2011 02:36 AM, Steven D'Aprano wrote:
> Karim wrote:
>
>>>> *Indeed what's the matter with RE module!?*
>>> You should really fix the problem with your email program first;
>> Thunderbird issue with bold type (appears as stars) but I don't know 
>> how to fix it yet.
>
> A man when to a doctor and said, "Doctor, every time I do this, it 
> hurts. What should I do?"
>
> The doctor replied, "Then stop doing that!"
>
> :)

Yes this these words made me laugh. I will keep it in my funny box.

>
>
> Don't add bold or any other formatting to things which should be 
> program code. Even if it looks okay in *your* program, you don't know 
> how it will look in other people's programs. If you need to draw 
> attention to something in a line of code, add a comment, or talk about 
> it in the surrounding text.
>
>
> [...]
>> That is not the thing I want. I want to escape any " which are not 
>> already escaped.
>> The sed regex  '/\([^\\]\)\?"/\1\\"/g' is exactly what I need (I have 
>> made regex on unix since 15 years).

Mainly sed, awk and perl sometimes grep and egrep. I know this is the 
jungle.

> Which regex? Perl regexes? sed or awk regexes? Extended regexes? GNU 
> posix compliant regexes? grep or egrep regexes? They're all different.
>
> In any case, I am sorry, I don't think your regex does what you say. 
> When I try it, it doesn't work for me.
>
> [steve at sylar ~]$ echo 'Some \"text"' | sed -e 's/\([^\\]\)\?"/\1\\"/g'
> Some \\"text\"

I give you my word on this. Exact output I redid it:

#MY OS VERSION
karim at Requiem4Dream:~$ uname -a
Linux Requiem4Dream 2.6.32-28-generic #55-Ubuntu SMP Mon Jan 10 23:42:43 
UTC 2011 x86_64 GNU/Linux
#MY SED VERSION
karim at Requiem4Dream:~$ sed --version
GNU sed version 4.2.1
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE,
to the extent permitted by law.

GNU sed home page: <http://www.gnu.org/software/sed/>.
General help using GNU software: <http://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-gnu-utils at gnu.org>.
Be sure to include the word ``sed'' somewhere in the ``Subject:'' field.
#MY SED OUTPUT COMMAND:
karim at Requiem4Dream:~$  echo 'Some ""' | sed -e 's/$[^\\]$\?"/\1\\"/g'
Some \"\"
# THIS IS WHAT I WANT 2 CONSECUTIVES IF THE FIRST ONE IS ALREADY ESCAPED 
I DON'T WANT TO ESCAPED IT TWICE.
karim at Requiem4Dream:~$ echo 'Some \""' | sed -e 's/$[^\\]$\?"/\1\\"/g'
Some \"\"
# BY THE WAY THIS ONE WORKS:
karim at Requiem4Dream:~$ echo 'Some "text"' | sed -e 's/$[^\\]$\?"/\1\\"/g'
Some \"text\"
# BUT SURE NOT THIS ONE NOT COVERED BY MY REGEX (I KNOW IT AND WANT 
ORIGINALY TO COVER IT):
karim at Requiem4Dream:~$ echo 'Some \"text"' | sed -e 
's/$[^\\]$\?"/\1\\"/g'
Some \\"text\"

By the way in all sed version I work with the '?'  (0 or one match) 
should be escaped that's the reason I have '\?' same thing with save 
'$' and '$' to store value. In perl, grep you don't need to escape.

# SAMPLE FROM http://www.gnu.org/software/sed/manual/sed.html

|\+|
    same As |*|, but matches one or more. It is a GNU extension.
|\?|
    same As |*|, but only matches zero or one. It is a GNU extension

> I wouldn't expect it to work. See below.
>
> By the way, you don't need to escape the brackets or the question mark:
>
> [steve at sylar ~]$ echo 'Some \"text"' | sed -re 's/([^\\])?"/\1\\"/g'
> Some \\"text\"
>
>
>> For me the equivalent python regex is buggy: r'([^\\])?"', r'\1\\"'
>
> No it is not.
>

Yes I know, see my latest post in detail I already found the solution. I 
put it again the solution below:

#Found the solution: '?' needs to be inside parenthesis (saved pattern) 
because outside we don't know if the saved match argument
#will exist or not namely '\1'.

 >>> re.subn(r'([^\\]?)"', r'\1\\"', expression)

(' \\"\\" ', 2)

> The pattern you are matching does not do what you think it does. "Zero 
> or one of not-backslash, followed by a quote" will match a single 
> quote *regardless* of what is before it. This is true even in sed, as 
> you can see above, your sed regex matches both quotes.
>
> \" will match, because the regular expression will match zero 
> characters, followed by a quote. So the regex is correct.
>
> >>> match = r'[^\\]?"'  # zero or one not-backslash followed by quote
> >>> re.search(match, r'aaa\"aaa').group()
> '"'
>
> Now watch what happens when you call re.sub:
>
>
> >>> match = r'([^\\])?"'  # group 1 equals a single non-backslash
> >>> replace = r'\1\\"'  # group 1 followed by \ followed by "
> >>> re.sub(match, replace, 'aaaa')  # no matches
> 'aaaa'
> >>> re.sub(match, replace, 'aa"aa')  # one match
> 'aa\\"aa'
> >>> re.sub(match, replace, '"aaaa')  # one match, but there's no group 1
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/local/lib/python3.1/re.py", line 166, in sub
>     return _compile(pattern, flags).sub(repl, string, count)
>   File "/usr/local/lib/python3.1/re.py", line 303, in filter
>     return sre_parse.expand_template(template, match)
>   File "/usr/local/lib/python3.1/sre_parse.py", line 807, in 
> expand_template
>     raise error("unmatched group")
> sre_constants.error: unmatched group
>
> Because group 1 was never matched, Python's re.sub raised an error. It 
> is not a very informative error, but it is valid behaviour.
>
> If I try the same thing in sed, I get something different:
>
> [steve at sylar ~]$ echo '"Some text' | sed -re 's/([^\\])?"/\1\\"/g'
> \"Some text
>
> It looks like this version of sed defines backreferences on the 
> right-hand side to be the empty string, in the case that they don't 
> match at all. But this is not standard behaviour. The sed FAQs say 
> that this behaviour will depend on the version of sed you are using:
>
> "Seds differ in how they treat invalid backreferences where no 
> corresponding group occurs."
>
> http://sed.sourceforge.net/sedfaq3.html
>
> So you can't rely on this feature. If it works for you, great, but it 
> may not work for other people.
>
>
> When you delete the ? from the Python regex, group 1 is always valid, 
> and you don't get an exception. Or if you ensure the input always 
> matches group 1, no exception:
>
> >>> match = r'([^\\])?"'
> >>> replace = r'\1\\"'
> >>> re.sub(match, replace, 'a"a"a"a') # group 1 always matches
> 'a\\"a\\"a\\"a'
>
> (It still won't do what you want, but that's a *different* problem.)
>
>
>
> Jamie Zawinski wrote:
>
>   Some people, when confronted with a problem, think "I know,
>   I'll use regular expressions." Now they have two problems.
>
> How many hours have you spent trying to solve this problem using 
> regexes? This is a *tiny* problem that requires an easy solution, not 
> wrestling with a programming language that looks like line-noise.
>
> This should do what you ask for:
>
> def escape(text):
>     """Escape any double-quote characters if and only if they
>     aren't already escaped."""
>     output = []
>     escaped = False
>     for c in text:
>         if c == '"' and not escaped:
>             output.append('\\')
>         elif c == '\\':
>             output.append('\\')
>             escaped = True
>             continue
>         output.append(c)
>         escaped = False
>     return ''.join(output)
>

Thank you for this one! This gives me some inspiration for other more 
complicated parsing. :-)

>
> Armed with this helper function, which took me two minutes to write, I 
> can do this:
>
> >>> text = 'Some text with backslash-quotes \\" and plain quotes " 
> together.'
> >>> print escape(text)
> Some text with backslash-quotes \" and plain quotes \" together.
>
>
> Most problems that people turn to regexes are best solved without 
> regexes. Even Larry Wall, inventor of Perl, is dissatisfied with regex 
> culture and syntax:
>
> http://dev.perl.org/perl6/doc/design/apo/A05.html

Ok but if I have to suppress all use of my one-liner sed regex most used 
utilities this is like refusing to use my car to go to work
and make 20km by feet.
  For overuse I can understand that though I already did 30 lines of 
pure sed script using all it features
which would have taken much more lines with awk or perl language.

Anyway I am inclined to python now so if a re module exists with my 
small regex there is no big deal to become familiar with this module.

Thanks for your efforts you've done.

Regards
Karim

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20110204/327d6566/attachment-0001.html>