regex pattern to extract repeating groups
Malcolm
blakemalc66 at gmail.com
Sat Aug 25 19:55:32 EDT 2018
I am trying to understand why regex is not extracting all of the
characters between two delimiters.
The complete string is the xmp IFD data extracted from a .CR2 image file.
I do have a work around, but it's messy and possibly not future proof.
Any insight greatly appreciated.
Malcolm
My test code is
import re
# environment # Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18)
[MSC v.1900 64 bit (AMD64)] on win32 # extract of real data for test
purposes. This extract is repeated # the delimiters are <dc: and </dc: extract =''' <dc:creator> <rdf:Seq> <rdf:li>abcdef zxcvb</rdf:li> </rdf:Seq>
</dc:creator> ''' # modify the test data modified_extract_1 =''' <dc:creator> <rdf:Seq> <rdf:li>abcdef zxcvb</rdf:li> </rdf:Seq>
</dc:creator> ''' # modify test data version 2 this works modified_extract_2 =''' <dc:creator> <rdf:li>abcdef zxcvb</rdf:li> </dc:creator> ''' re_pattern =r'( *<dc:.*</dc:)' print('extract', re.search(re_pattern, extract, re.DOTALL))
# >>> s1 <_sre.SRE_Match object; span=(1, 89), match=' <dc:creator>\n
<rdf:Seq>\n <rdf:li>abcd> print('modified_extract_1', re.search(re_pattern, modified_extract_1, re.DOTALL))
# >>> sre.SRE_Match object; span=(1, 70),
match='<dc:creator>\n<rdf:Seq>\n<rdf:li>abcdef zxcvb</rd> print('modified_extract_2', re.search(re_pattern, modified_extract_2, re.DOTALL))
# >>> s <_sre.SRE_Match object; span=(1, 49),
match='<dc:creator>\n<rdf:li>abcdef zxcvb</rdf:li>\n</dc> # NOTE the
missing ':' from the </dc I
More information about the Python-list
mailing list