[Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?
Steven D'Aprano
steve at pearwood.info
Sun Nov 20 22:45:42 CET 2011
dave selby wrote:
> I split the HTML and print text and I get loads of
>
> \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character.
>
> Any idea what is happening and how to get back to a list of ascii strings ?
How did you generate the HTML file? What other applications have you
used to save the document?
Something in the tool chain before it reached Python has saved it using
a wide (four byte) encoding, most likely UTF-16 as that is widely used
by Windows and Java. With the right settings, it could take as little as
opening the file in Notepad, then clicking Save.
If this isn't making sense to you, you should read this:
http://www.joelonsoftware.com/articles/Unicode.html
If my guess is right that the file is UTF-16, then you can "fix" it by
doing this:
# Untested.
f = open("my_html_file.html", "r")
text = f.read().decode("utf-16") # convert bytes to text
f.close()
bytes = text.encode("ascii") # If this fails, try "latin-1" instead
f = open("my_html_file2.html", "w") # write bytes back to disk
f.write(bytes)
f.close()
Once you've inspected the re-written file my_html_file2.html and it is
okay to your satisfaction, you can delete the original one.
--
Steven
More information about the Tutor
mailing list