unicode .replace not working - why?
nospampetersk at bigfoot.com
Sun Oct 19 00:47:02 CEST 2008
The "distraction" was my problem. I replaced the textu.replace as you
suggested and it works fine.
On Sun, 12 Oct 2008 19:53:09 -0700, Mark Tolonen wrote:
> In your original code:
> as Dennis suggested (but maybe you were distracted by his 'fn'
> replacement, so I'll leave it out):
> textu = textu.replace(unichr(167),'\n')
> .replace does not modify the string in place. It returns the modified
> string, so you have to reassign it.
> "Kurt Peters" <nospampeterskurt at msn.com> wrote in message
> news:-OmdnXghhrxMN2_VnZ2dnUVZ_rHinZ2d at comcast.com...
>> clearly though, my "For loop" shows a character using ord(167), and
>> print repr(textu), it shows the character \xa7 (as does Peter Oten's
>> post). So you can see what I see, here's the document I'm using - the
>> Special Use Airspace document at
>> http://www.faa.gov/airports_airtraffic/air_traffic/publications/ which
>> is = JO 7400.8P (PDF)
>> if you just look at page three, it shows those unusual characters. Once
>> again, using a "simple" replace, doesn't seem to work. I can't seem to
>> figure out how to get it to work, despite all the great posts
>> attempting to shed some light on the subject.
>> "John Machin" <sjmachin at lexicon.net> wrote in message
e49a-49a3-8a2c-1adbcbb81d88 at u40g2000pru.googlegroups.com...
>> On Oct 12, 7:05 am, Kurt Peters <nospampete... at bigfoot.com> wrote:
>>> I'm using the code below to read a pdf document, and it has no line
>>> feeds or carriage returns in the imported text. I'm therefore trying
>>> to just replace the symbol that looks like it would be an end of line
>>> (found by examining the characters in the "for loop") unichr(167).
>>> Unfortunately, the replace isn't working, does anyone know what I'm
>>> doing wrong? I tried a number of things so I left comments in place as
>>> a subset of the bunch of things I tried to no avail.
>> This is the first time I've ever looked inside a PDF file, and *only*
>> one file, but:
>> import pyPdf, sys
>> filename = sys.argv
>> doc = pyPdf.PdfFileReader(open(filename, "rb")) for pageno in
>> page = doc.getPage(pageno)
>> textu = page.extractText()
>> print "pageno", pageno
>> print type(textu)
>> print repr(textu)
>> gives me <type 'unicode'> and text with lots of \n at places where
>> you'd expect them.
>> The only problem I can see is that where I see (and expect) quotation
>> marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
>> the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
>> and apostrophes. I had a bit of a poke around:
>> 1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
>> \x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
>> into \x93 and \x94).
>> 2. Then pyPdf appears to push these through a fixed transformation
>> table (_pdfDocEncoding in generic.py) and they become \ufb01 and
>> 3. However:
>> |>>> '\x93\x94'.decode('cp1252') # as suspected |u'\u201c\u201d' # as
>> AFAICT there is only one reference to encoding in the pyPdf docs: "if
>> pyPdf was unable to decode the string's text encoding" ...
More information about the Python-list