regex pattern to extract repeating groups
Malcolm
blakemalc66 at gmail.com
Mon Aug 27 19:58:57 EDT 2018
On 28/08/2018 7:09 AM, John Pote wrote:
> On 26/08/2018 00:55, Malcolm wrote:
>> I am trying to understand why regex is not extracting all of the
>> characters between two delimiters.
>>
>> The complete string is the xmp IFD data extracted from a .CR2 image
>> file.
>>
>> I do have a work around, but it's messy and possibly not future proof.
> Do you mean future proof your workaround or Cannon's .CR2 raw image
> files might change? I guess .CR2's won't change but Cannon have
> brought out the new .CR3 raw image file for which I needed to upgrade
> my photo editing suit (at least I didn't but used their tool to
> convert .CR3s from the camera to the digital negative format which
> many photo editors can handle.) Can send you sample .CR3 if you want
> to compare.
>
> Regards,
> John
John
Thank you.
Some background
The application is for personal use. Why I'm familiar with python
generally (and thanks to all who post code and answer questions), this
is the first time I have used structs to read a binary file, xml parsers
to parse some of the RFD contents and re.
First
I have now discovered that when print the return of re.search that the
matched='truncates the matched characters'. To see/get all found
characters I need to use the span as indexes to the original string. I'm
not sure if this is mentioned in the re documentation. But all the
samples I've seen on the web use only small strings. This was the cause
of my question.
for example
import re
data = '''
<dc:creator>
<rdf:Seq>
<rdf:li>abcdef zxcvb</rdf:li>
</rdf:Seq>
</dc:creator>
'''
re_pattern =r'( *<dc:.*</dc:)' x = re.search(re_pattern, data, re.DOTALL)
print(x)
print(data[x.span()[0] : x.span()[1]])
returns
<_sre.SRE_Match object; span=(1, 89), match=' <dc:creator>\n <rdf:Seq>\n <rdf:li>abcd>
<dc:creator>
<rdf:Seq>
<rdf:li>abcdef zxcvb</rdf:li>
</rdf:Seq>
</dc:
Second
By future proofing: At the moment I'm testing code against one .CR2
image. My wish at the moment is that my code will work on all of my .CR2
images from different cameras. When I upgrade my camera(s) to one(s)
that produces .CR3 images I will, no doubt, need to re test my code.
All I trying to do really is to extract some metadata and a
thumbnail/preview jpg using python instead of relying on subprocess and
exiftool/ exiv2. trying to speed things up. Oh and I got side tracked
on learning something new.
Malcolm
the full RDF-XMP extracted truncated
xml_data = '''<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:Iptc4xmpCore="http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:exif="http://ns.adobe.com/exif/1.0/"
xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/"
xmlns:tiff="http://ns.adobe.com/tiff/1.0/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
Iptc4xmpCore:CountryCode="AUS"
Iptc4xmpCore:Location="Binna Burra"
exif:DateTimeDigitized="2018-07-30T09:18:24+10:00"
exif:DateTimeOriginal="2018-07-30T09:18:24+10:00"
exif:GPSAltitude="4052/5"
exif:GPSAltitudeRef="0"
exif:GPSLatitude="28,11.734230S"
exif:GPSLongitude="153,11.218140E"
exif:GPSMapDatum="WGS-84"
exif:GPSSpeed="28033/5697"
exif:GPSSpeedRef="K"
exif:GPSTimeStamp="2018-07-29T23:18:24Z"
exif:GPSVersionID="2.2.0.0"
photoshop:City="Lamington National Park"
photoshop:Country="Australia"
photoshop:DateCreated="2018-07-30T09:18:24+10:00"
photoshop:State="Qld"
tiff:Artist="Malcolm Blake"
xmp:ModifyDate="2018-07-30T09:18:24+10:00">
<dc:creator>
<rdf:Seq>
<rdf:li>Malcolm Blake</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:rights>
<rdf:Alt>
<rdf:li xml:lang="x-default">Malcolm Blake</rdf:li>
</rdf:Alt>
</dc:rights>
<dc:subject>
<rdf:Bag>
<rdf:li>AUS, Arthur Grooms Cottage, Australia, Binna Burra,
Lamington National Park, Qld</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
</rdf:RDF>'''
More information about the Python-list
mailing list