reading PDF using Python [Q]
Nick Moon
ncmoon at cix.compulink.co.uk
Tue May 11 05:38:31 EDT 1999
> > I have been playing with parsing pdf files in python. The format
> > of .pdf is documented on Adobe's web site.
>
> Any usefull URL?
Try the adobe site. www.adobe.com but you knew that. The document you want
is called 'Portable Document Format Reference Manual - Version 1.2'.
Though I think Acrobat v4 means there is now a version 1.3. It's in
surprisingly .pdf format and it's big - about 400 pages when printed.
It is pretty unreadable, but it does describe the file format in mind
numbingly boring detail. The pdf format itself, looks like the work of
several different people over several different years. Different bits of
the format seem to use rather different styles of data structures.
> Do you know more about PDF encryption and compression?
PDF files have a general structure, something like: A header, A list of
objects, A lookup table, An end. The lookup table is a list of offsets to
each object. It allows program to open the file from the end and then jump
direct to each object as required. Updates can be appended to a file
without changing any of the contents of the file. The updates consist of
some objects and a new lookup table and end section.
Actual page descriptions which is probably what you want to look at are
stored in a stream - the stream is then inside an object. Streams may be
written/read using various filters. A typical filter set would be:
ASCII85Decode / LZWDecode
Which means it has been compressed using LZW then the binary output of LZW
has been turned into ASCII (base 85)
Cheers,
Nick.
More information about the Python-list
mailing list