Puzzled by code pages
Adam Tauno Williams
awilliam at whitemice.org
Sat May 15 10:12:40 EDT 2010
On Sat, 2010-05-15 at 20:30 +1000, Lie Ryan wrote:
> On 05/15/10 10:27, Adam Tauno Williams wrote:
> > I'm trying to process OpenStep plist files in Python. I have a parser
> > which works, but only for strict ASCII. However plist files may contain
> > accented characters - equivalent to ISO-8859-2 (I believe). For example
> > I read in the line:
> >>>> handle = open('file.txt', 'rb')
> >>>> data = handle.read()
> >>>> handle.close()
> >>>> data
> > ' "skyp4_filelist_10201/localit\xc3\xa0 termali_sortfield" =
> > NSFileName;\n'
> I presume you're using Python 2.x.
Yes. But the days of all-unicode-strings will be wonderful when it
comes. :)
> > What is the correct way to re-encode this data into UTF-8 so I can use
> > unicode strings, and then write the output back to ISO8859-?
> > I can read the file using codecs as ISO8859-2, but it still doesn't seem
> > correct.
> >>>> handle = codecs.open('file.txt', 'rb', encoding='iso8859-2')
> >>>> data = handle.read()
> >>>> handle.close()
> >>>> data
> > u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
> > NSFileName;\n'
> When printing in the interactive interpreter, python uses __repr__
> representation by default. If you want to use __str__ representation use
> "print data" (note, your terminal must support printing unicode
> characters);
Using GNOME Terminal, so Unicode characters should display correctly.
And I do see the characters when I 'cat' the file.
> either way, even though the string looks like '\u0102' when
> printed on the terminal, the binary pattern inside the memory should
> correctly represents the accented character.
Yep. But in the interpreter both unicode() and repr() produce the same
output. Nothing displays the accented character.
h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
data = h.read()
h.close()
str(data)
'ascii' codec can't encode characters in position 33-34: ordinal not in
range(128)
unicode(data)
u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
NSFileName;\n'
repr(data)
'u\' "skyp4_filelist_10201/localit\\u0102\\xa0 termali_sortfield" =
NSFileName;\\n\''
I think I'm getting close. Parsing the file seems to work, and while
writing it out does not error, rereading my own output fails. :(
Possibly I'm 'accidentally' writing the output as UTF-8 and not
ISO8859-2. I need the internal data to be UTF-8 but read as ISO8859-2
and rewritten back to ISO8859-2 [at least that is what I believe from
the OpenStep files I'm seeing].
What is the 'official' way to encode something from UTF-8 to another
code page. I *assumed* that if I wrote a unicode stream back through:
h = codecs.open(output_filename, 'wb', encoding='iso8859-2')
data = writer.store(defaults)
h.write(data)
h.close()
that is would be re-encoded [word?]. But maybe not?
> f = codecs.open("in.txt", 'rb', encoding="iso8859-2")
> f2 = codecs.open("out.txt", 'wb', encoding="utf-8")
> s = f.read()
> f2.write(s)
> f.close()
> f2.close()
--
Adam Tauno Williams <awilliam at whitemice.org> LPIC-1, Novell CLA
<http://www.whitemiceconsulting.com>
OpenGroupware, Cyrus IMAPd, Postfix, OpenLDAP, Samba
-------------- next part --------------
"skyp4_filelist_10201/località termali_sortfield" = NSFileName;
More information about the Python-list
mailing list