On Thu, Oct 15, 2009 at 4:43 PM, Stef Mientki <span dir="ltr"><<a href="mailto:stef.mientki@gmail.com">stef.mientki@gmail.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
hello,<br>
<br>
By writing the following unicode string (I hope it can be send on this mailing list)<br>
<br>
Bücken<br>
<br>
to a file<br>
<br>
fh.write ( line )<br>
<br>
I get the following error:<br>
<br>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 9: ordinal not in range(128)<br>
<br>
How should I write such a string to a file ?<br>
<br></blockquote><div><br></div><div>First, you have to understand that a file never really contains unicode-- not in the way that it exists in memory / in python when you type line = u'<span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 18px; border-collapse: collapse; ">Bücken'<span class="Apple-style-span" style="border-collapse: separate; font-family: arial; font-size: small; ">. It contains a series of bytes that are an encoded form of that abstract unicode data.</span></span></div>
<div><br></div><div>There's various encodings you can use-- UTF-8 and UTF-16 are in my experience the most common. UTF-8 is an ASCII-superset, and its the one I see most often.</div><div><br></div><div>So, you can do:</div>
<div><br></div><div> import codecs</div><div> f = codecs.open('filepath', 'w', 'utf-8')</div><div> f.write(line)</div><div><br></div><div>To read such a file, you'd do codecs.open as well, just with a 'r' mode and not a 'w' mode.</div>
<div><br></div><div>Now, that uses a file object created with the "codecs" module which operates with theoretical unicode streams. It will automatically take any passed in unicode strings, encode them in the specified encoding (utf8), and write the resulting bytes out.</div>
<div><br></div><div>You can also do that manually with a regular file object, via:</div><div><br></div><div> f.write(line.encode("utf8"))</div><div><br></div><div>If you are reading such a file later with a normal file object (e.g., not one created with codecs.open), you would do:</div>
<div><br></div><div> f = open('filepath', 'rb')</div><div> byte_data = f.read()</div><div> uni_data = byte_data.decode("utf8")</div><div><br></div><div>That will convert the byte-encoded data back to real unicode strings. Be sure to do this even if it doesn't seem you need to if the file contains encoded unicode data (a thing you can only know based on documentation of whatever produced that file)... for example, a UTF8 encoded file might look and work like a completely normal ASCII file, but if its really UTF8... eventually your code will break that one time someone puts in a non-ascii character. Since UTF8 is an ASCII superset, its indistinguishable from ASCII until it contains a non-ASCII character.</div>
<div><br></div><div>HTH,</div><div><br></div><div>--S</div></div>