How to escape # hash character in regex match strings

Lie Ryan lie.1296 at gmail.com
Sun Jun 14 09:48:38 CEST 2009


Brian D wrote:
> On Jun 11, 9:22 am, Brian D <brianden... at gmail.com> wrote:
>> On Jun 11, 2:01 am, Lie Ryan <lie.1... at gmail.com> wrote:
>>
>>
>>
>>> 504cr... at gmail.com wrote:
>>>> I've encountered a problem with my RegEx learning curve -- how to
>>>> escape hash characters # in strings being matched, e.g.:
>>>>>>> string = re.escape('123#abc456')
>>>>>>> match = re.match('\d+', string)
>>>>>>> print match
>>>> <_sre.SRE_Match object at 0x00A6A800>
>>>>>>> print match.group()
>>>> 123
>>>> The correct result should be:
>>>> 123456
>>>> I've tried to escape the hash symbol in the match string without
>>>> result.
>>>> Any ideas? Is the answer something I overlooked in my lurching Python
>>>> schooling?
>>> As you're not being clear on what you wanted, I'm just guessing this is
>>> what you wanted:
>>>>>> s = '123#abc456'
>>>>>> re.match('\d+', re.sub('#\D+', '', s)).group()
>>> '123456'
>>>>>> s = '123#this is a comment and is ignored456'
>>>>>> re.match('\d+', re.sub('#\D+', '', s)).group()
>>> '123456'
>> Sorry I wasn't more clear. I positively appreciate your reply. It
>> provides half of what I'm hoping to learn. The hash character is
>> actually a desirable hook to identify a data entity in a scraping
>> routine I'm developing, but not a character I want in the scrubbed
>> data.
>>
>> In my application, the hash makes a string of alphanumeric characters
>> unique from other alphanumeric strings. The strings I'm looking for
>> are actually manually-entered identifiers, but a real machine-created
>> identifier shouldn't contain that hash character. The correct pattern
>> should be 'A1234509', but is instead often merely entered as '#12345'
>> when the first character, representing an alphabet sequence for the
>> month, and the last two characters, representing a two-digit year, can
>> be assumed. Identifying the hash character in a RegEx match is a way
>> of trapping the string and transforming it into its correct machine-
>> generated form.
>>
>> I'm surprised it's been so difficult to find an example of the hash
>> character in a RegEx string -- for exactly this type of situation,
>> since it's so common in the real world that people want to put a pound
>> symbol in front of a number.
>>
>> Thanks!
> 
> By the way, other forms the strings can take in their manually created
> forms:
> 
> A#12345
> #1234509
> 
> Garbage in, garbage out -- I know. I wish I could tell the people
> entering the data how challenging it is to work with what they
> provide, but it is, after all, a screen-scraping routine.

perhaps it's like this?

>>> # you can use re.search if that suits better
>>> a = re.match('([A-Z]?)#(\d{5})(\d\d)?', 'A#12345')
>>> b = re.match('([A-Z]?)#(\d{5})(\d\d)?', '#1234509')
>>> a.group(0)
'A#12345'
>>> a.group(1)
'A'
>>> a.group(2)
'12345'
>>> a.group(3)
>>> b.group(0)
'#1234509'
>>> b.group(1)
''
>>> b.group(2)
'12345'
>>> b.group(3)
'09'



More information about the Python-list mailing list