Converting text file to different encoding.
davea at davea.name
Fri Apr 17 16:48:57 CEST 2015
On 04/17/2015 09:19 AM, subhabrata.banerji at gmail.com wrote:
> I am having few files in default encoding. I wanted to change their encodings,
> preferably in "UTF-8", or may be from one encoding to any other encoding.
You neglected to specify what Python version this is for. Other
information that'd be useful is whether the file size is small enough
that two copies of it will all fit reasonably into memory.
I'll assume it's version 2.7, because of various clues in your sample
code. But if it's version 3.x, it could be substantially easier.
> I was trying it as follows,
> >>> import codecs
> >>> sourceEncoding = "iso-8859-1"
> >>> targetEncoding = "utf-8"
> >>> source = open("source1","w")
mode "w" will truncate the source1 file, leaving you nothing to process.
i'd suggest "r"
> >>> target = open("target", "w")
It's not usually a good idea to use the same variable for both the file
name and the opened file object. What if you need later to print the
name, as in an error message?
> >>> target.write(unicode(source, sourceEncoding).encode(targetEncoding))
I'd not recommend trying to do so much in one line, at least until you
understand all the pieces. Programming is not (usually) a contest to
write the most obscure code, but rather to make a program you can still
read and understand six months from now. And, oh yeah, something that
will run and accomplish something.
> but it was giving me error as follows,
> Traceback (most recent call last):
> File "<pyshell#6>", line 1, in <module>
> target.write(unicode(source, sourceEncoding).encode(targetEncoding))
> TypeError: coercing to Unicode: need string or buffer, file found
if you factor this you will discover your error. Nowhere do you read
the source file into a byte string. And that's what is needed for the
unicode constructor. Factored, you might have something like:
encodedtext = source.read()
text = unicode(source, sourceEncoding)
reencodedtext = text.encode(targetEncoding)
Next, you need to close the files.
There are a number of ways to improve that code, but this is a start.
Use codecs.open() to open the files, so encoding is handled
implicitly in the file objects.
Use with... syntax so that the file closes are implicit
read and write the files in a loop, a line at a time, so that you
needn't have all the data in memory (at least twice) at one time. This
will also help enormously if you encounter any errors, and want to
report the location and problem to the user. It might even turn out to
You should write non-trivial code in a text file, and run it from
More information about the Python-list