[Tutor] Opening filenames with unicode characters
mail at timgolden.me.uk
Thu Jun 28 22:17:40 CEST 2012
On 28/06/2012 20:48, James Chapman wrote:
> The name of the file I'm trying to open comes from a UTF-16 encoded
> text file, I'm then using regex to extract the string (filename) I
> need to open.
OK. Let's focus on that. For the moment -- although it might
well be very relevant -- I'm going to ignore the regex side
of things. It's always trying to portray things like this
because there's such confusion between what characters I
write to represent the data and the data represented by those
OK, let's adopt a convention whereby I represent the data as
they kind of thing you'd see in a hex editor. This obviously
isn't how it appear in a a text file but hopefully it'll be
clear what's going on.
I have a filename £10.txt -- that is the characters:
LATIN SMALL LETTER T
LATIN SMALL LETTER X
LATIN SMALL LETTER T
I have -- prior to your getting there -- placed this in a text
file which I guarantee is UTF16-encoded. For the purposes of
illustration I shall do that in Python code here:
with open ("filedata.dat", "wb") as f:
f.write (u"£10.txt".encode ("utf16"))
The file is named "filedata.dat" and looks like this (per our convention):
ff fe a3 00 31 00 30 00 2e 00 74 00 78 00 74 00
I now want to read the contents of the that file as a
filename and open the file in question. Here goes:
# Open the file and extract the data as a set of
# bytes into a Python (byte) string.
with open("filedata.dat", "rb") as f:
data = f.read()
# Convert the data into a unicode object by decoding
# the UTF16 bytes
filename = data.decode("utf16")
# filename is now a unicode object which, depending on
# what your console offers, will either display as
# £10.txt or as \xa310.txt or as something else.
# Open that file by passing the unicode object directly
# to Python's file-opening mechanism
ten_pound_txt = open (filename, "rb")
print ten_pound_txt.read () # whatever
I don't know if that makes anything clearer for you, but at
least it gives you something to try out.
The business with the regex clouds the issue: regex can play
a little awkwardly with Unicode, so you'd have to show some
code if you need help there.
More information about the Tutor