unicode .replace not working - why?

Kurt Peters nospampeterskurt at msn.com
Sun Oct 12 21:56:14 EDT 2008


Thanks,
  clearly though, my "For loop" shows a character using ord(167), and using 
print repr(textu), it shows the character \xa7 (as does Peter Oten's post). 
So you can see what I see, here's the document I'm using - the Special Use 
Airspace document at
http://www.faa.gov/airports_airtraffic/air_traffic/publications/
which is = JO 7400.8P (PDF)

if you just look at page three, it shows those unusual characters.
Once again, using a "simple" replace, doesn't seem to work.  I can't seem to 
figure out how to get it to work, despite all the great posts attempting to 
shed some light on the subject.

Regards,
Kurt


"John Machin" <sjmachin at lexicon.net> wrote in message 
news:42f39e4c-e49a-49a3-8a2c-1adbcbb81d88 at u40g2000pru.googlegroups.com...
On Oct 12, 7:05 am, Kurt Peters <nospampete... at bigfoot.com> wrote:
> I'm using the code below to read a pdf document, and it has no line feeds
> or carriage returns in the imported text. I'm therefore trying to just
> replace the symbol that looks like it would be an end of line (found by
> examining the characters in the "for loop") unichr(167).
> Unfortunately, the replace isn't working, does anyone know what I'm
> doing wrong? I tried a number of things so I left comments in place as a
> subset of the bunch of things I tried to no avail.

This is the first time I've ever looked inside a PDF file, and *only*
one file, but:

import pyPdf, sys
filename = sys.argv[1]
doc = pyPdf.PdfFileReader(open(filename, "rb"))
for pageno in range(doc.getNumPages()):
    page = doc.getPage(pageno)
    textu = page.extractText()
    print "pageno", pageno
    print type(textu)
    print repr(textu)

gives me <type 'unicode'> and text with lots of \n at places where
you'd expect them.

The only problem I can see is that where I see (and expect) quotation
marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
and apostrophes. I had a bit of a poke around:

1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
\x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
into \x93 and \x94).

2. Then pyPdf appears to push these through a fixed transformation
table (_pdfDocEncoding in generic.py) and they become \ufb01 and
\ufb02.

3. However:
|>>> '\x93\x94'.decode('cp1252') # as suspected
|u'\u201c\u201d' # as expected
|>>>

AFAICT there is only one reference to encoding in the pyPdf docs: "if
pyPdf was unable to decode the string's text encoding" ...

Cheers,
John 





More information about the Python-list mailing list