Puzzling PDF

Alister alister.ware at ntlworld.com
Sun Feb 16 19:59:37 CET 2014


On Sun, 16 Feb 2014 10:33:39 -0500, Roy Smith wrote:

> In article <mailman.7056.1392559276.18130.python-list at python.org>,
>  "F.R." <anthra.norell at bluewin.ch> wrote:
> 
>> Hi all,
>> 
>> Struggling to parse bank statements unavailable in sensible
>> data-transfer formats, I use pdftotext, which solves part of the
>> problem. The other day I encountered a strange thing, when one single
>> figure out of many erroneously converted into letters. Adobe Reader
>> displays the figure 50'000 correctly, but pdftotext makes it into
>> "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would
>> expect such a mistake from an OCR. However, the statement is not a
>> scan,
>> but is made up of text. Because malfunctions like this put a damper on
>> the hope to ever have a reliable reader that doesn't require
>> time-consuming manual verification, I played around a bit and ended up
>> even more confused: When I lift the figure off the Adobe display (mark,
>> copy) and paste it into a Python IDLE window, it is again letters
>> (ascii 83 and 79), when on the Adobe display it shows correctly as
>> digits. How can that be?
>> 
>> Frederic
> 
> Maybe it's an intentional effort to keep people from screen-scraping
> data out of the PDFs (or perhaps trace when they do).  Is it possible
> the document includes a font where those codepoints are drawn exactly
> the same as the digits they resemble?

This seems to be the most likely explanation to me although I would like 
to know why.
Assuming these are your bank statements I would change bank

Mine are available in a variety of formats (QIF & CSV) so that they can 
be used in my own accounting programs if i desire.

I see no reason why the bank would want to prevent me accessing this data



-- 
Without life, Biology itself would be impossible.



More information about the Python-list mailing list