[Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

Sun Nov 20 22:45:42 CET 2011

dave selby wrote:

> I split the HTML and print text and I get loads of
> 
> \x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.
> 
> Any idea what is happening and how to get back to a list of ascii strings ?

How did you generate the HTML file? What other applications have you 
used to save the document?

Something in the tool chain before it reached Python has saved it using 
a wide (four byte) encoding, most likely UTF-16 as that is widely used 
by Windows and Java. With the right settings, it could take as little as 
opening the file in Notepad, then clicking Save.

If this isn't making sense to you, you should read this:

http://www.joelonsoftware.com/articles/Unicode.html

If my guess is right that the file is UTF-16, then you can "fix" it by 
doing this:

# Untested.
f = open("my_html_file.html", "r")
text = f.read().decode("utf-16")  # convert bytes to text
f.close()
bytes = text.encode("ascii")  # If this fails, try "latin-1" instead
f = open("my_html_file2.html", "w")  # write bytes back to disk
f.write(bytes)
f.close()

Once you've inspected the re-written file my_html_file2.html and it is 
okay to your satisfaction, you can delete the original one.

-- 
Steven