[Tutor] Regex help

Bill Burns billburns at pennswoods.net
Mon Oct 10 13:02:00 CEST 2005


[Andrew]
> If the format is consistent enough,  you might get away with something like:
> 
>  >>> p = re.compile('MediaBox \[ ?\d+ \d+ (\d+) (\d+) ?\]')
>  >>> print p.search(s).groups()
> ('612', '792')
> 
> The important bits being:  ? means "0 or 1 occurences", and you can use 
> parentheses to group matches, and they get put into the tuple returned 
> by the .groups() function. So you can match and extract what you want in 
> one go.
> 
> http://www.amk.ca/python/howto/regex/ is a fairly gentle introduction to 
> regular expressions in Python if you want to learn more.
> 
> Having said all that, usually you would use a library of some sort to 
> access header information, although I'm not sure what Python has for PDF 
> support, and if that's -all- the information you need, and the -only- 
> variation you'll see, regex probably won't be too bad :)
> 

Thanks, Andrew!

Yes, the format is consistent (I believe the whitespace I mentioned is
the only difference you may find).

I'll take a look at your use of group matches tonight, looks like a
really easy way to return the two numbers I need.

Yeah, I was hoping to find a python PDF library that could do this, but
things seem a little sparse in this area. The only info I need is the
PDF size and it's consistently located (and tagged) in the MediaBox so I
figured it was a good way to get the data.

Bill


More information about the Tutor mailing list