reading text in pdf, some working sample code

Daniel Gross grossd18 at gmail.com
Tue Nov 21 10:18:43 EST 2017


Hi,

I am new to python and jumped right into trying to read out (english) text
from PDF files.

I tried various libraries (including slate) out there but am running into
diverse problems, such as with encoding or buffer too small errors -- deep
inside some decompression code.

Essentially, i want to extract all text and then do some natural language
processing on the text. Is there some sample code available that works
together with a clear description of the expected python installatin
environment needed.

In slate btw, i got the buffer error, it seems i must "guess" the right
encoding of the text included in the PDF when opening the file. Still
trying to figure out how to get the encoding info out of the PDF ... (if
available there)

thank you,

Daniel



More information about the Python-list mailing list