Script to extract text from PDF files

Lawrence D'Oliveiro ldo at geek-central.gen.new_zealand
Wed Sep 26 04:19:00 CEST 2007


In message <1190747931.415834.75670 at n39g2000hsh.googlegroups.com>, 
byte8bits at gmail.com wrote:

> On Sep 25, 3:02 pm, Paul Hankin <paul.han... at gmail.com> wrote:
>
>> Googling for 'pdf to text python' and following the first link
>> giveshttp://pybrary.net/pyPdf/
> 
> Doesn't work that well...

This is inherent in the nature of PDF: it's a page-description language, not
a document-interchange language. Each text-drawing command can put a block
of text anywhere on the page, so you have no idea, just from parsing the
PDF content, how to join these blocks up into lines, paragraphs, columns
etc.



More information about the Python-list mailing list