regular expression extracting groups
clawsicus at gmail.com
clawsicus at gmail.com
Sun Aug 10 08:30:23 EDT 2008
Hi list,
I'm trying to use regular expressions to help me quickly extract the
contents of messages that my application will receive. I have worked
out most of the regex but the last section of the message has me
stumped. This is mostly because I want to pull the content out into
regex groups that I can easily access later. I have a regex to extract
the key/value pairs but it ends up with only the contents of the last
key/value pair encountered.
An example of the section of the message that is troubling me appears
like this:
{
option=value
foo=bar
another=42
option=7
}
So it's basically a bunch of lines. Every line is terminated with a
'\n' character. The number of key/value fields changes depending on
the particular message. Also notice that there are two 'option' keys.
This is allowable and I need to cater for it.
A couple of example messages are:
xpl-stat\n{\nhop=1\nsource=vendor-device.instance\ntarget=*\n}
\nhbeat.basic\n{\ninterval=10\n}\n
xpl-stat\n{\nhop=1\nsource=vendor-device.instance\ntarget=vendor-
device.instance\n}\nconfig.list\n{\nreconf=newconf\noption=interval
\noption=group[16]\noption=filter[16]\n}\n
As all messages follow the same pattern I'm hoping to develop a
generic regex, instead of one for each message kind - because there
are many, that can pull a message from a received packet.
The regex I came up with looks like this:
# This should match any xPL message
GROUP_MESSAGE_TYPE = 'message_type'
GROUP_HOP = 'hop'
GROUP_SOURCE = 'source'
GROUP_TARGET = 'target'
GROUP_SRC_VENDOR_ID = 'source_vendor_id'
GROUP_SRC_DEVICE_ID = 'source_device_id'
GROUP_SRC_INSTANCE_ID = 'source_instance_id'
GROUP_TGT_VENDOR_ID = 'target_vendor_id'
GROUP_TGT_DEVICE_ID = 'target_device_id'
GROUP_TGT_INSTANCE_ID = 'target_instance_id'
GROUP_IDENTIFIER_TYPE = 'identifier_type'
GROUP_SCHEMA = 'schema'
GROUP_SCHEMA_CLASS = 'schema_class'
GROUP_SCHEMA_TYPE = 'schema_type'
GROUP_OPTION_KEY = 'key'
GROUP_OPTION_VALUE = 'value'
XplMessageGroupsRe = r'''(?P<%s>xpl-(cmnd|stat|trig))
\n # message type
\
{\n
#
hop=(?P<%s>[1-9]{1})
\n # hop
count
source=(?P<%s>(?P<%s>[a-z0-9]{1,8})-(?P<%s>[a-z0-9]{1,8})\.(?P<
%s>[a-z0-9]{1,16}))\n # source identifier
target=(?P<%s>(\*|(?P<%s>[a-z0-9]{1,8})-(?P<%s>[a-z0-9]{1,8})\.(?P<
%s>[a-z0-9]{1,16})))\n # target identifier
\}
\n
#
(?P<%s>(?P<%s>[a-z0-9]{1,8})\.(?P<%s>[a-z0-9]{1,8}))\n
# schema
\
{\n
#
(?:(?P<%s>[a-z0-9\-]{1,16})=(?P<%s>[\x20-\x7E]{0,128})\n){1,64} #
key/value pairs
\}\n''' % (GROUP_MESSAGE_TYPE,
GROUP_HOP,
GROUP_SOURCE,
GROUP_SRC_VENDOR_ID,
GROUP_SRC_DEVICE_ID,
GROUP_SRC_INSTANCE_ID,
GROUP_TARGET,
GROUP_TGT_VENDOR_ID,
GROUP_TGT_DEVICE_ID,
GROUP_TGT_INSTANCE_ID,
GROUP_SCHEMA,
GROUP_SCHEMA_CLASS,
GROUP_SCHEMA_TYPE,
GROUP_OPTION_KEY,
GROUP_OPTION_VALUE)
XplMessageGroups = re.compile(XplMessageGroupsRe, re.VERBOSE |
re.DOTALL)
If I pass the second example message through this regex the 'key'
group ends up containing 'option' and the 'value' group ends up
containing 'filter[16]' which are the last key/value pairs in that
message.
So the problem I have lies in the key/value regex extraction section.
It handles multiple occurrences of the pattern and writes the content
into the single key/value group hence I can't extract and access all
fields.
Is there some other way to do this which allows me to store all the
key/value pairs into the regex match object for later retrieval?
Perhaps using the standard unnamed number groups?
Thanks,
Chris
More information about the Python-list
mailing list