How to write simple code to match strings?
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Wed Dec 30 01:01:57 EST 2009
On Tue, 29 Dec 2009 21:01:05 -0800, beginner wrote:
> Hi All,
>
> I run into a problem. I have a string s that can be a number of
> possible things. I use a regular expression code like below to match and
> parse it. But it looks very ugly. Also, the strings are literally
> matched twice -- once for matching and once for extraction -- which
> seems to be very slow. Is there any better way to handle this?
The most important thing you should do is to put the regular expressions
into named variables, rather than typing them out twice. The names
should, preferably, describe what they represent.
Oh, and you should use raw strings for regexes. In this particular
example, I don't think it makes a difference, but if you ever modify the
strings, it will!
You should get rid of the unnecessary double calls to match. That's just
wasteful. Also, since re.match tests the start of the string, you don't
need the leading ^ regex (but you do need the $ to match the end of the
string).
You should also fix the syntax error, where you have "elif s=='-'"
instead of "elif s='-'".
You should consider putting the cheapest test(s) first, or even moving
the expensive tests into a separate function.
And don't be so stingy with spaces in your source code, it helps
readability by reducing the density of characters.
So, here's my version:
def _re_match_items(s):
# Setup some regular expressions.
COMMON_RE = r'\$?([-+]?[0-9,]*\.?[0-9,]+)'
FLOAT_RE = COMMON_RE + '$'
BRACKETED_FLOAT_RE = r'\(' + COMMON_RE + r'\)$'
DATE_RE = r'\d{1,2}-\w+-\d{1,2}$'
mo = re.match(FLOAT_RE, s) # "mo" short for "match object"
if mo:
return float(mo.group(1).replace(',', ''))
# Otherwise mo will be None and we go on to the next test.
mo = re.match(BRACKETED_FLOAT_RE, s)
if mo:
return -float(mo.group(1).replace(',', ''))
if re.match(DATE_RE, s):
return dateutil.parser.parse(s, dayfirst=True)
raise ValueError("bad string can't be matched")
def convert_data_item(s):
if s = '-':
return None
else:
try:
return _re_match_items(s)
except ValueError:
print "Unrecognized format %s" % s
return s
Hope this helps.
--
Steven
More information about the Python-list
mailing list