Dive Into Python: call for comments (long)

Mon Apr 23 18:50:23 EDT 2001

in article mailman.988058444.8527.python-list at python.org, Sean 'Shaleh'
Perry at shaleh at valinux.com wrote on 4/23/01 4:40 PM:
> what about simple cases like 'I' or 'V'?  'IV'?

This is a good idea.  Maybe all the cases where the integer translates to a
single-character Roman numeral (1, 5, 10, 50, &c).  And the numbers from 1
to 10.  The set of known values was just a random sampling throughout the
domain; perhaps a little less randomness is in order.

> Also, I find
> 
> #Define pattern to detect valid Roman numerals
> romanNumeralPattern = \
> re.compile('^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$')
> 
> confusing.

"Some people, when confronted with a problem, think 'I know, I'll use
regular expressions.'  Now they have two problems." (Jamie Zawinski)

>  As I read this, I see 'begins with optionally up to 3 M's, then
> either CM, CD, or some permutation of D and C'.  Yet this appears to match
> something as simply as V.  Why?

It matches 'V' because it matches 'D?C?C?C?' with '', then 'L?X?X?X?' with
'', then 'V?I?I?I?' with 'V'.  The pattern is meant to match all and only
the valid Roman numerals for the numbers 1..3999.  (It does, in fact, match
all of them; I believe it matches only them.)  It exists so we can validate
the input to fromRoman() up front.

> 
> Also, perhaps use of the {} operator in the regex's might help.  Maybe not.
> 

You're absolutely right.  The regular expression can be rewritten as

romanNumeralPattern = \
    re.compile('^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$')

Is this more efficient?  It is more elegant (as regular expressions go)?
Any regular expression experts care to comment?

-M