[Tutor] Reading big files

alan.gauld@bt.com alan.gauld@bt.com
Wed, 24 Nov 1999 12:00:04 -0000


> I have a PDF file I want to read. I used the following code.
> 
> f=open('D:\\Documents\\MyFile')

You'll need to open it in binary mode:

f=open('D:\\Documents\\MyFile', 'rb')

Actually I assume your file ends in .pdf so you'll need that too!

f=open('D:\\Documents\\MyFile.PDF', 'rb')

Make sure explorer view settings are set to show the full 
filename, the default hides known extensions....

Then you'll need to find a reference for the internal format of 
the PDF file and decode the binary bytes that you read.

> Where I am wrong. Is the file too big. Has it a special 
> character in it that doesn't allow to read the entire content.

It probably has some kind of binary character combination 
that looks like EOF to Python.

One thing to try is open the file using debug from a DOS prompt:

C:> debug foo.pdf

Use 'd' at the prompt to dump a listing of the file in hex.
There is also an ASCII listing on the right. Compare the 
characters on the right with the hex patterns on the left 
- that may help you decode the PDF format sufficiently 
well to extract the text you want....


Alan G.