Regular Expression Help

Graham Breed x31eq at cnntp.org
Mon Apr 13 02:22:55 EDT 2009


Jean-Claude Neveu wrote:
> Hello,
> 
> I was wondering if someone could tell me where I'm going wrong with my 
> regular expression. I'm trying to write a regexp that identifies whether 
> a string contains a correctly-formatted currency amount. I want to 
> support dollars, UK pounds and Euros, but the example below deliberately 
> omits Euros in case the Euro symbol get mangled anywhere in email or 
> listserver processing. I also want people to be able to omit the 
> currency symbol if they wish.

If Euro symbols can get mangled, so can Pound signs. 
They're both outside ASCII.

> My regexp that I'm matching against is: "^\$\£?\d{0,10}(\.\d{2})?$"
> 
> Here's how I think it should work (but clearly I'm wrong, because it 
> does not actually work):
> 
> ^\$\£?      Require zero or one instance of $ or £ at the start of the 
> string.

^[$£]? is correct.  And, as you're using re.match, the ^ is 
superfluous.  (A previous message suggested ^[\$£]? which 
will also work.  You generally need to escape a Dollar sign 
but not here.)

You should also think about the encoding.  In my terminal, 
"£" is identical to '\xc2\xa3'.  That is, two bytes for a 
UTF-8 code point.  If you assume this encoding, it's best to 
make it explicit.  And if you don't assume a specific 
encoding it's best to convert to unicode to do the 
comparisons, so for 2.x (or portability) your string should 
start u"

> d{0,10}     Next, require between zero and ten alpha characters.

There's a backslash missing, but not from your original 
expression.  Digits are not "alpha characters".

> (\.\d{2})?  Optionally, two characters can follow. They must be preceded 
> by a decimal point.

That works.  Of course, \d{2} is longer than the simpler \d\d

Note that you can comment the original expression like this:

rex = u"""(?x)
     ^[$£]?    # Zero or one instance of $ or £
                # at the start of the string.
     \d{0,10}   # Between zero and ten digits
     (\.\d{2})? # Optionally, two digits.
                # They must be preceded by a decimal point.
     $          # End of line
"""

Then anybody (including you) who comes to read this in the 
future will have some idea what you were trying to do.

\> Examples of acceptable input should be:
> 
> $12.42
> $12
> £12.42
> $12,482.96  (now I think about it, I have not catered for this in my 
> regexp)

Yes, you need to think about that.


                Graham




More information about the Python-list mailing list