Checking that 2 pdf are identical (md5 a solution?)
Peter Otten
__peter__ at web.de
Sat Jul 24 11:50:31 EDT 2010
rlevesque wrote:
> Hi
>
> I am working on a program that generates various pdf files in the /
> results folder.
>
> "scenario1.pdf" results from scenario1
> "scenario2.pdf" results from scenario2
> etc
>
> Once I am happy with scenario1.pdf and scenario2.pdf files, I would
> like to save them in the /check folder.
>
> Now after having developed/modified the program to produce
> scenario3.pdf, I would like to be able to re-generate
> files
> /results/scenario1.pdf
> /results/scenario2.pdf
>
> and compare them with
> /check/scenario1.pdf
> /check/scenario2.pdf
>
> I tried using the md5 module to compare these files but md5 reports
> differences even though the code has *not* changed at all.
>
> Is there a way to compare 2 pdf files generated at different time but
> identical in every other respect and validate by program that the
> files are identical (for all practical purposes)?
Here's a naive approach, but it may be good enough for your purpose.
I've printed the same small text into 1.pdf and 2.pdf
(Bad practice warning: this session is slightly doctored; I hope I haven't
introduced an error)
>>> a = open("1.pdf").read()
>>> b = open("2.pdf").read()
>>> diff = [i for i, (x, y) in enumerate(zip(a, c)) if x != y]
>>> len(diff)
2
>>> diff
[160, 161]
>>> a[150:170]
'0100724151412)\n>>\nen'
>>> a[140:170]
'nDate (D:20100724151412)\n>>\nen'
>>> a[130:170]
')\n/CreationDate (D:20100724151412)\n>>\nen'
OK, let's ignore "lines" starting with "/CreationDate " for our custom
comparison function:
>>> def equal_pdf(fa, fb):
... with open(fa) as a:
... with open(fb) as b:
... for la, lb in izip_longest(a, b, fillvalue=""):
... if la != lb:
... if not la.startswith("/CreationDate
"): return False
... if not lb.startswith("/CreationDate
"): return False
... return True
...
>>> from itertools import izip_longest
>>> equal_pdf("1.pdf", "2.pdf")
True
Peter
More information about the Python-list
mailing list