How to write simple code to match strings?

beginner zyzhu2000 at gmail.com
Wed Dec 30 02:07:15 EST 2009


Hi Steve,

On Dec 30, 12:01 am, Steven D'Aprano <st... at REMOVE-THIS-
cybersource.com.au> wrote:
> On Tue, 29 Dec 2009 21:01:05 -0800, beginner wrote:
> > Hi All,
>
> > I run into a problem.  I have a string s that can be a number of
> > possible things. I use a regular expression code like below to match and
> > parse it. But it looks very ugly. Also, the strings are literally
> > matched twice -- once for matching and once for extraction -- which
> > seems to be very slow. Is there any better way to handle this?
>
> The most important thing you should do is to put the regular expressions
> into named variables, rather than typing them out twice. The names
> should, preferably, describe what they represent.
>
> Oh, and you should use raw strings for regexes. In this particular
> example, I don't think it makes a difference, but if you ever modify the
> strings, it will!
>
> You should get rid of the unnecessary double calls to match. That's just
> wasteful. Also, since re.match tests the start of the string, you don't
> need the leading ^ regex (but you do need the $ to match the end of the
> string).
>
> You should also fix the syntax error, where you have "elif s=='-'"
> instead of "elif s='-'".
>
> You should consider putting the cheapest test(s) first, or even moving
> the expensive tests into a separate function.
>
> And don't be so stingy with spaces in your source code, it helps
> readability by reducing the density of characters.
>
> So, here's my version:
>
> def _re_match_items(s):
>     # Setup some regular expressions.
>     COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)'
>     FLOAT_RE = COMMON_RE + '$'
>     BRACKETED_FLOAT_RE = r'\(' + COMMON_RE + r'\)$'
>     DATE_RE = r'\d{1,2}-\w+-\d{1,2}$'
>     mo = re.match(FLOAT_RE, s)  # "mo" short for "match object"
>     if mo:
>         return float(mo.group(1).replace(',', ''))
>     # Otherwise mo will be None and we go on to the next test.
>     mo = re.match(BRACKETED_FLOAT_RE, s)
>     if mo:
>         return -float(mo.group(1).replace(',', ''))
>     if re.match(DATE_RE, s):
>         return dateutil.parser.parse(s, dayfirst=True)
>     raise ValueError("bad string can't be matched")
>
> def convert_data_item(s):
>     if s = '-':
>         return None
>     else:
>         try:
>             return _re_match_items(s)
>         except ValueError:
>             print "Unrecognized format %s" % s
>             return s
>
> Hope this helps.
>
> --
> Steven

This definitely helps.

I don't know if it should be s=='-' or s='-'. I thought == means equal
and = means assignment?

Thanks again,
G






More information about the Python-list mailing list