[Tutor] Regex help

Andrew P grouch at gmail.com
Mon Oct 10 08:26:33 CEST 2005


If the format is consistent enough, you might get away with something like:

>>> p = re.compile('MediaBox \[ ?\d+ \d+ (\d+) (\d+) ?\]')
>>> print p.search(s).groups()
('612', '792')

The important bits being: ? means "0 or 1 occurences", and you can use
parentheses to group matches, and they get put into the tuple returned by
the .groups() function. So you can match and extract what you want in one
go.

http://www.amk.ca/python/howto/regex/ is a fairly gentle introduction to
regular expressions in Python if you want to learn more.

Having said all that, usually you would use a library of some sort to access
header information, although I'm not sure what Python has for PDF support,
and if that's -all- the information you need, and the -only- variation
you'll see, regex probably won't be too bad :)


On 10/9/05, Bill Burns <billburns at pennswoods.net> wrote:
>
> I'm looking to get the size (width, length) of a PDF file. Every pdf
> file has a 'tag' (in the file) that looks similar to this
>
> Example #1
> MediaBox [0 0 612 792]
>
> or this
>
> Example #2
> MediaBox [ 0 0 612 792 ]
>
> I figured a regex might be a good way to get this data but the
> whitespace (or no whitespace) after the left bracket has me stumped.
>
> If I do this
>
> pattern = re.compile('MediaBox \[\d+ \d+ \d+ \d+')
>
> I can find the MediaBox in Example #1 but I have to do this
>
> pattern = re.compile('MediaBox \[ \d+ \d+ \d+ \d+')
>
> to find it for Example #2.
>
> How can I make *one* regex that will match both cases?
>
> Thanks for the help,
>
> Bill
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20051010/f6ed08aa/attachment.htm


More information about the Tutor mailing list