[Tutor] Regex help
billburns at pennswoods.net
Mon Oct 10 13:02:00 CEST 2005
> If the format is consistent enough, you might get away with something like:
> >>> p = re.compile('MediaBox \[ ?\d+ \d+ (\d+) (\d+) ?\]')
> >>> print p.search(s).groups()
> ('612', '792')
> The important bits being: ? means "0 or 1 occurences", and you can use
> parentheses to group matches, and they get put into the tuple returned
> by the .groups() function. So you can match and extract what you want in
> one go.
> http://www.amk.ca/python/howto/regex/ is a fairly gentle introduction to
> regular expressions in Python if you want to learn more.
> Having said all that, usually you would use a library of some sort to
> access header information, although I'm not sure what Python has for PDF
> support, and if that's -all- the information you need, and the -only-
> variation you'll see, regex probably won't be too bad :)
Yes, the format is consistent (I believe the whitespace I mentioned is
the only difference you may find).
I'll take a look at your use of group matches tonight, looks like a
really easy way to return the two numbers I need.
Yeah, I was hoping to find a python PDF library that could do this, but
things seem a little sparse in this area. The only info I need is the
PDF size and it's consistently located (and tagged) in the MediaBox so I
figured it was a good way to get the data.
More information about the Tutor