Non-unicode strings & Python.
"Martin v. Löwis"
martin at v.loewis.de
Tue Aug 31 01:48:57 EDT 2004
Jonathon Blake wrote:
> What happens when strings are read in from text files that were
> created using GB 2312-1980, or KPS 9566-2003, or other, equally
> obscure code ranges?
Python has two kinds of strings: byte strings, and Unicode strings.
If you read data from a file, you get byte strings - i.e. a sequence
of bytes representing literally the encoded contents of the file.
If you want Unicode strings, you need to use codecs.open.
> The idea is to read text in the file format, and replace it with the
> appropriate Unicode character,then write it out as a new text file.
> [Trivial to program, but incredibly time consuming to actually code]
Not at all:
data = codecs.open(filename, "r", encoding="gb2312")
codecs.open(newfile, "w", encoding="utf-8").write(data)
assuming that by "appropriate Unicode character" you actually mean
"I want to write the file encoded as UTF-8".
Regards,
Martin
More information about the Python-list
mailing list