[Tutor] Regular expression - I

Tue Feb 18 23:55:46 CET 2014

On 02/18/2014 08:39 PM, Zachary Ware wrote:
> Hi Santosh,
>
> On Tue, Feb 18, 2014 at 9:52 AM, Santosh Kumar <rhce.san at gmail.com> wrote:
>>
>> Hi All,
>>
>> If you notice the below example, case I is working as expected.
>>
>> Case I:
>> In [41]: string = "<H*>test<H*>"
>>
>> In [42]: re.match('<H\*>',string).group()
>> Out[42]: '<H*>'
>>
>> But why is the raw string 'r' not working as expected ?
>>
>> Case II:
>>
>> In [43]: re.match(r'<H*>',string).group()
>> ---------------------------------------------------------------------------
>> AttributeError                            Traceback (most recent call last)
>> <ipython-input-43-d66b47f01f1c> in <module>()
>> ----> 1 re.match(r'<H*>',string).group()
>>
>> AttributeError: 'NoneType' object has no attribute 'group'
>>
>> In [44]: re.match(r'<H*>',string)
>
> It is working as expected, but you're not expecting the right thing
> ;).  Raw strings don't escape anything, they just prevent backslash
> escapes from expanding.  Case I works because "\*" is not a special
> character to Python (like "\n" or "\t"), so it leaves the backslash in
> place:
>
>     >>> '<H\*>'
>     '<H\*>'
>
> The equivalent raw string is exactly the same in this case:
>
>     >>> r'<H\*>'
>     '<H\*>'
>
> The raw string you provided doesn't have the backslash, and Python
> will not add backslashes for you:
>
>     >>> r'<H*>'
>     '<H*>'
>
> The purpose of raw strings is to prevent Python from recognizing
> backslash escapes.  For example:
>
>     >>> path = 'C:\temp\new\dir' # Windows paths are notorious...
>     >>> path   # it looks mostly ok... [1]
>     'C:\temp\new\\dir'
>     >>> print(path)  # until you try to use it
>     C:      emp
>     ew\dir
>     >>> path = r'C:\temp\new\dir'  # now try a raw string
>     >>> path   # Now it looks like it's stuffed full of backslashes [2]
>     'C:\\temp\\new\\dir'
>     >>> print(path)  # but it works properly!
>     C:\temp\new\dir
>
> [1] Count the backslashes in the repr of 'path'.  Notice that there is
> only one before the 't' and the 'n', but two before the 'd'.  "\d" is
> not a special character, so Python didn't do anything to it.  There
> are two backslashes in the repr of "\d", because that's the only way
> to distinguish a real backslash; the "\t" and "\n" are actually the
> TAB and LINE FEED characters, as seen when printing 'path'.
>
> [2] Because they are all real backslashes now, so they have to be
> shown escaped ("\\") in the repr.
>
> In your regex, since you're looking for, literally, "<H*>", you'll
> need to backslash escape the "*" since it is a special character *in
> regular expressions*.  To avoid having to keep track of what's special
> to Python as well as regular expressions, you'll need to make sure the
> backslash itself is escaped, to make sure the regex sees "\*", and the
> easiest way to do that is a raw string:
>
>     >>> re.match(r'<H\*>', string).group()
>     '<H*>'
>
> I hope this makes some amount of sense; I've had to write it up
> piecemeal and will never get it posted at all if I don't go ahead and
> post :).  If you still have questions, I'm happy to try again.  You
> may also want to have a look at the Regex HowTo in the Python docs:
> http://docs.python.org/3/howto/regex.html

In addition to all this:
* You may confuse raw strings with "regex escaping" (a tool func that escapes 
special regex characters for you).
* For simplicity, always use raw strings for regex formats (as in your second 
example); this does not prevent you to escape special characters, but you only 
have to do it once!

d