[Tutor] Regex help

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Mon Oct 10 23:46:21 CEST 2005



On Mon, 10 Oct 2005, Bill Burns wrote:

> I'm looking to get the size (width, length) of a PDF file.


Hi Bill,

Just as a side note: you may want to look into using the 'pdfinfo' utility
that comes as part of the xpdf package:

    http://www.foolabs.com/xpdf/

For example:

#######################################################################
[dyoo at shoebox ~]$ pdfinfo 05-lexparse.pdf
Producer:       Acrobat Distiller Command 3.0 for Solaris 2.3 and later
(SPARC)
CreationDate:   Tue Jul  1 18:36:35 1913
Tagged:         no
Pages:          12
Encrypted:      no
Page size:      612 x 792 pts (letter)
File size:      191874 bytes
Optimized:      no
PDF version:    1.2
#######################################################################



> Every pdf file has a 'tag' (in the file) that looks similar to this
>
> Example #1
> MediaBox [0 0 612 792]
>
> or this
>
> Example #2
> MediaBox [ 0 0 612 792 ]
>
> I figured a regex might be a good way to get this data but the
> whitespace (or no whitespace) after the left bracket has me stumped.


I think you might want to look for the whitespace metacharacter '\s'.
Also, you can consider using '*' to qualify a previous pattern: it stands
for "zero or more of the pattern."  For example:

#####################################
>>> re.search("a*b", "aab")
<_sre.SRE_Match object at 0x403ae250>
>>> re.search("a*b", "ab")
<_sre.SRE_Match object at 0x403ae138>
>>> re.search("a*b", "b")
<_sre.SRE_Match object at 0x403ae250>
>>> re.search("a*b", "")
>>>
#####################################

In comparison:


#####################################
>>> re.search("a+b", "aab")
<_sre.SRE_Match object at 0x403ae138>
>>> re.search("a+b", "ab")
<_sre.SRE_Match object at 0x403ae250>
>>> re.search("a+b", "b")
>>>
#####################################


Good luck to you!



More information about the Tutor mailing list