Regular Expression for Finding and Deleting comments

MRAB python at mrabarnett.plus.com
Tue Jan 4 14:54:34 EST 2011


On 04/01/2011 19:37, Jeremy wrote:
> On Tuesday, January 4, 2011 11:26:48 AM UTC-7, MRAB wrote:
>> On 04/01/2011 17:11, Jeremy wrote:
>>> I am trying to write a regular expression that finds and deletes (replaces with nothing) comments in a string/file.  Comments are defined by the first non-whitespace character is a 'c' or a dollar sign somewhere in the line.  I want to replace these comments with nothing which isn't too hard.  The trouble is, the comments are replaced with a new-line; or the new-line isn't captured in the regular expression.
>>>
>>> Below, I have copied a minimal example.  Can someone help?
>>>
>>> Thanks,
>>> Jeremy
>>>
>>>
>>> import re
>>>
>>> text = """ c
>>> C - Second full line comment (first comment had no text)
>>> c   Third full line comment
>>>     F44:N 2    $ Inline comments start with dollar sign and go to end of line"""
>>>
>>> commentPattern = re.compile("""
>>>       (^\s*?c\s*?.*?|             # Comment start with c or C
>>>       \$.*?)$\n                           # Comment starting with $
>>>       """, re.VERBOSE|re.MULTILINE|re.IGNORECASE)
>>>
>> Part of the problem is that you're not using raw string literals or
>> doubling the backslashes.
>>
>> Try soemthing like this:
>>
>> commentPattern = re.compile(r"""
>>       (^[ \t]*c.*\n|              # Comment start with c or C
>>       [ \t]*\$.*)                 # Comment starting with $
>>       """, re.VERBOSE|re.MULTILINE|re.IGNORECASE)
>
> Using a raw string literal fixed the problem for me.  Thanks for the suggestion.  Why is that so important?
>
Regexes often use escape sequences, but so do string literals, and a
sequence which is intended for the regex engine might not get passed
along correctly. For example, in a normal string literal \b means
'backspace' and will be passed to the regex engine as that; in a regex
it usually means 'word boundary':

     A regex for "the" as a word: \bthe\b

     As a raw string literal:     r"\bthe\b"

     As a normal string literal:  "\\bthe\\b"

     "\bthe\b" means:             backspace + "the" + backspace



More information about the Python-list mailing list