[Tutor] unicode to plain text conversion

Tue Apr 7 17:56:16 CEST 2009

Thanks all!

Kent, this syntax worked. I was able to figure it out the encoding just
with trial and error. It is utf16. Now the only thing is that the
conversion is double-spacing the lines of data. I'm thinking this must
be something that I need to fix in my syntax. I will continue to try and
figure it out, but any pointing out of the obvious or other ideas would
be much appreciated. Again, newbie here.

Thanks
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
Medical Services Initiative (MSI)
Orange County Health Care Agency
(714) 568-5648
-----Original Message-----
From: tutor-bounces+mpirritano=ochca.com at python.org
[mailto:tutor-bounces+mpirritano=ochca.com at python.org] On Behalf Of Kent
Johnson
Sent: Monday, April 06, 2009 5:51 PM
To: Pirritano, Matthew
Cc: Python Tutor
Subject: Re: [Tutor] unicode to plain text conversion

On Mon, Apr 6, 2009 at 6:48 PM, Pirritano, Matthew
<MPirritano at ochca.com> wrote:
> Hello python people,
>
> I am a total newbie. I have a very large file > 4GB that I need to
> convert from Unicode to plain text. I used to just use dos when the
file
> was < 4GB but it no longer seems to work. Can anyone point me to some
> python code that might perform this function?

What is the encoding of the Unicode file?

Assuming that the file has lines that will each fit in memory, you can
use the codecs module to decode the unicode. Something like this:

import codecs

inp = codecs.open('Unicode_file.txt', 'r', 'utf-16le')
outp = open('new_text_file.txt')
outp.writelines(inp)
inp.close()
outp.close()

The above code assumes UTF-16LE encoding, change it to the correct one
if that is not right. A list of supported encodings is here:
http://docs.python.org/library/codecs.html#id3

Kent
_______________________________________________
Tutor maillist  -  Tutor at python.org
http://mail.python.org/mailman/listinfo/tutor