[Tutor] Regular expression - I
spir
denis.spir at gmail.com
Tue Feb 18 23:55:46 CET 2014
On 02/18/2014 08:39 PM, Zachary Ware wrote:
> Hi Santosh,
>
> On Tue, Feb 18, 2014 at 9:52 AM, Santosh Kumar <rhce.san at gmail.com> wrote:
>>
>> Hi All,
>>
>> If you notice the below example, case I is working as expected.
>>
>> Case I:
>> In [41]: string = "<H*>test<H*>"
>>
>> In [42]: re.match('<H\*>',string).group()
>> Out[42]: '<H*>'
>>
>> But why is the raw string 'r' not working as expected ?
>>
>> Case II:
>>
>> In [43]: re.match(r'<H*>',string).group()
>> ---------------------------------------------------------------------------
>> AttributeError Traceback (most recent call last)
>> <ipython-input-43-d66b47f01f1c> in <module>()
>> ----> 1 re.match(r'<H*>',string).group()
>>
>> AttributeError: 'NoneType' object has no attribute 'group'
>>
>> In [44]: re.match(r'<H*>',string)
>
> It is working as expected, but you're not expecting the right thing
> ;). Raw strings don't escape anything, they just prevent backslash
> escapes from expanding. Case I works because "\*" is not a special
> character to Python (like "\n" or "\t"), so it leaves the backslash in
> place:
>
> >>> '<H\*>'
> '<H\*>'
>
> The equivalent raw string is exactly the same in this case:
>
> >>> r'<H\*>'
> '<H\*>'
>
> The raw string you provided doesn't have the backslash, and Python
> will not add backslashes for you:
>
> >>> r'<H*>'
> '<H*>'
>
> The purpose of raw strings is to prevent Python from recognizing
> backslash escapes. For example:
>
> >>> path = 'C:\temp\new\dir' # Windows paths are notorious...
> >>> path # it looks mostly ok... [1]
> 'C:\temp\new\\dir'
> >>> print(path) # until you try to use it
> C: emp
> ew\dir
> >>> path = r'C:\temp\new\dir' # now try a raw string
> >>> path # Now it looks like it's stuffed full of backslashes [2]
> 'C:\\temp\\new\\dir'
> >>> print(path) # but it works properly!
> C:\temp\new\dir
>
> [1] Count the backslashes in the repr of 'path'. Notice that there is
> only one before the 't' and the 'n', but two before the 'd'. "\d" is
> not a special character, so Python didn't do anything to it. There
> are two backslashes in the repr of "\d", because that's the only way
> to distinguish a real backslash; the "\t" and "\n" are actually the
> TAB and LINE FEED characters, as seen when printing 'path'.
>
> [2] Because they are all real backslashes now, so they have to be
> shown escaped ("\\") in the repr.
>
> In your regex, since you're looking for, literally, "<H*>", you'll
> need to backslash escape the "*" since it is a special character *in
> regular expressions*. To avoid having to keep track of what's special
> to Python as well as regular expressions, you'll need to make sure the
> backslash itself is escaped, to make sure the regex sees "\*", and the
> easiest way to do that is a raw string:
>
> >>> re.match(r'<H\*>', string).group()
> '<H*>'
>
> I hope this makes some amount of sense; I've had to write it up
> piecemeal and will never get it posted at all if I don't go ahead and
> post :). If you still have questions, I'm happy to try again. You
> may also want to have a look at the Regex HowTo in the Python docs:
> http://docs.python.org/3/howto/regex.html
In addition to all this:
* You may confuse raw strings with "regex escaping" (a tool func that escapes
special regex characters for you).
* For simplicity, always use raw strings for regex formats (as in your second
example); this does not prevent you to escape special characters, but you only
have to do it once!
d
More information about the Tutor
mailing list