[Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?
steve at pearwood.info
Sun Nov 20 22:45:42 CET 2011
dave selby wrote:
> I split the HTML and print text and I get loads of
> \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character.
> Any idea what is happening and how to get back to a list of ascii strings ?
How did you generate the HTML file? What other applications have you
used to save the document?
Something in the tool chain before it reached Python has saved it using
a wide (four byte) encoding, most likely UTF-16 as that is widely used
by Windows and Java. With the right settings, it could take as little as
opening the file in Notepad, then clicking Save.
If this isn't making sense to you, you should read this:
If my guess is right that the file is UTF-16, then you can "fix" it by
f = open("my_html_file.html", "r")
text = f.read().decode("utf-16") # convert bytes to text
bytes = text.encode("ascii") # If this fails, try "latin-1" instead
f = open("my_html_file2.html", "w") # write bytes back to disk
Once you've inspected the re-written file my_html_file2.html and it is
okay to your satisfaction, you can delete the original one.
More information about the Tutor