Convert to big5 to unicode

Richard Schulman raschulmanxx at verizon.net
Thu Sep 7 19:11:50 CEST 2006


On 7 Sep 2006 01:27:55 -0700, "GM" <garymok33 at gmail.com> wrote:

>Could you all give me some guide on how to convert my big5 string to
>unicode using python? I already knew that I might use cjkcodecs or
>python 2.4 but I still don't have idea on what exactly I should do.
>Please give me some sample code if you could. Thanks a lot

Gary, I used this Java program quite a few years ago to convert
various Big5 files to UTF-16. (Sorry it's Java not Python, but I'm a
very recent convert to the latter.) My newsgroup reader has messed the
formatting up somewhat. If this causes a problem, email me and I'll
send you the source directly.

-Richard Schulman

/*	This program converts an input file of one encoding format to
an output file of 
 *	another format. It will be mainly used to convert Big5 text
files to Unicode text files.
 */		  

import java.io.*;
public class ConvertEncoding
{	public static void	main(String[] args)
	{	String outfile =	null;
		try
		{	 convert(args[0], args[1],  "BIG5",
"UTF-16LE");
		}
		//	Or, at command line:
		//		convert(args[0], args[1], "GB2312",
"UTF8");
		//	or numerous variations thereon. Among possible
choices for input or output:
		//		"GB2312", "BIG5", "UTF8", "UTF-16LE".
The last named is MS UCS-2 format.
		//		I.e., "input file","output file",
"input encoding", "output encoding"
		catch (Exception	e)
		{	System.out.print(e.getMessage());
			System.exit(1);
		}
	 }

	public static void convert(String infile, String outfile,
String from, String to)	
		 throws IOException,	UnsupportedEncodingException
	{	// set up byte streams
		InputStream in;
		if (infile	!=	null)
		 	in = new FileInputStream(infile);
		else
			in = System.in;

		OutputStream out;
		if (outfile != null)
		 	out = new FileOutputStream(outfile);
		else
			out = System.out;

		 // Set up character stream
		Reader r =	new BufferedReader(new
InputStreamReader(in, from));
		Writer w =	new BufferedWriter(new
OutputStreamWriter(out, to));

		 w.write("\ufeff");	// This character signals
Unicode in the NT environment
		char[] buffer	= new char[4096];
		int len;
		while((len = r.read(buffer)) != -1) 
		w.write(buffer, 0, len);
		r.close();
		w.flush();
		w.close();
	}
}



More information about the Python-list mailing list