Non-unicode strings & Python.

Tue Aug 31 01:48:57 EDT 2004

Jonathon Blake wrote:
> What happens when strings are read in from text files that were
> created using  GB 2312-1980, or KPS 9566-2003, or other, equally
> obscure code ranges?

Python has two kinds of strings: byte strings, and Unicode strings.
If you read data from a file, you get byte strings - i.e. a sequence
of bytes representing literally the encoded contents of the file.
If you want Unicode strings, you need to use codecs.open.

> The idea is to read text in the file format, and replace it with the
> appropriate Unicode character,then write it out as a new text file. 
> [Trivial to program, but incredibly time consuming to actually code]

Not at all:

data = codecs.open(filename, "r", encoding="gb2312")
codecs.open(newfile, "w", encoding="utf-8").write(data)

assuming that by "appropriate Unicode character" you actually mean
"I want to write the file encoded as UTF-8".

Regards,
Martin