Checking that 2 pdf are identical (md5 a solution?)

Peter Otten __peter__ at web.de
Sat Jul 24 11:50:31 EDT 2010


rlevesque wrote:

> Hi
> 
> I am working on a program that generates various pdf files in the /
> results folder.
> 
> "scenario1.pdf"  results from scenario1
> "scenario2.pdf" results from scenario2
> etc
> 
> Once I am happy with scenario1.pdf and scenario2.pdf files, I would
> like to save them in the /check folder.
> 
> Now after having developed/modified the program to produce
> scenario3.pdf, I would like to be able to re-generate
> files
> /results/scenario1.pdf
> /results/scenario2.pdf
> 
> and compare them with
> /check/scenario1.pdf
> /check/scenario2.pdf
> 
> I tried using the md5 module to compare these files but md5 reports
> differences even though the code has *not* changed at all.
> 
> Is there a way to compare 2 pdf files generated at different time but
> identical in every other respect and validate by program that the
> files are identical (for all practical purposes)?

Here's a naive approach, but it may be good enough for your purpose.
I've printed the same small text into 1.pdf and 2.pdf

(Bad practice warning: this session is slightly doctored; I hope I haven't 
introduced an error)

>>> a = open("1.pdf").read()
>>> b = open("2.pdf").read()
>>> diff = [i for i, (x, y) in enumerate(zip(a, c)) if x != y]
>>> len(diff)
2
>>> diff
[160, 161]
>>> a[150:170]
'0100724151412)\n>>\nen'
>>> a[140:170]
'nDate (D:20100724151412)\n>>\nen'
>>> a[130:170]
')\n/CreationDate (D:20100724151412)\n>>\nen'

OK, let's ignore "lines" starting with "/CreationDate " for our custom 
comparison function:

>>> def equal_pdf(fa, fb):
...     with open(fa) as a:
...             with open(fb) as b:
...                     for la, lb in izip_longest(a, b, fillvalue=""):
...                             if la != lb:
...                                     if not la.startswith("/CreationDate 
"): return False
...                                     if not lb.startswith("/CreationDate 
"): return False
...                     return True
...
>>> from itertools import izip_longest
>>> equal_pdf("1.pdf", "2.pdf")
True

Peter



More information about the Python-list mailing list