Unicode string handling problem
John Machin
sjmachin at lexicon.net
Tue Sep 5 21:34:59 EDT 2006
Richard Schulman wrote:
> The following program fragment works correctly with an ascii input
> file.
>
> But the file I actually want to process is Unicode (utf-16 encoding).
> The file must be Unicode rather than ASCII or Latin-1 because it
> contains mixed Chinese and English characters.
>
> When I run the program below I get an attribute_count of zero, which
> is incorrect for the input file, which should give a value of fifteen
> or sixteen. In other words, the count function isn't recognizing the
> ", characters in the line being read. Here's the program:
>
> in_file = open("c:\\pythonapps\\in-graf1.my","rU")
> try:
> # Skip the first line; make the second available for processing
> in_file.readline()
> in_line = readline()
You mean in_line = in_file.readline(), I hope. Do please copy/paste
actual code, not what you think you ran.
> attribute_count = in_line.count('",')
> print attribute_count
Insert
print type(in_line)
print repr(in_line)
here [also make the appropriate changes to get the same info from the
first line], run it again, copy/paste what you get, show us what you
see.
If you're coy about that, then you'll have to find out yourself if it
has a BOM at the front, and if not whether it's little/big/endian.
> finally:
> in_file.close()
>
> Any suggestions?
>
1. Read the Unicode HOWTO.
2. Read the docs on the codecs module ...
You'll need to use
in_file = codecs.open(filepath, mode, encoding="utf16???????")
It would also be a good idea to get into the habit of using unicode
constants like u'",'
HTH,
John
More information about the Python-list
mailing list