[Tutor] Opening filenames with unicode characters
James Chapman
james at uplinkzero.com
Thu Jun 28 21:48:19 CEST 2012
Informative thanks Jerry, however I'm not out of the woods yet.
> Here's a couple of questions that you'll need to answer 'Yes' to
> before you're going to get this to work reliably:
>
> Are you familiar with the differences between byte strings and unicode
> strings?
I think so, although I'm probably missing key bits of information.
> Do you understand how to convert from one to the other,
> using a particular encoding?
No not really. This is something that's still very new to me.
> Do you know what encoding your source
> file is saved in?
The name of the file I'm trying to open comes from a UTF-16 encoded text file, I'm then using regex to extract the string (filename) I need to open. However, all the examples I've been using here are just typed into the python console, meaning string source at this stage is largely irrelevant.
> If your string is not coming from a source file,
> but some other source of bytes, do you know what encoding those bytes
> are using?
>
> Try the following. Before trying to convert filename to unicode, do a
> "print repr(filename)". That will show you the byte string, along
> with the numeric codes for the non-ascii parts. Then convert those
> bytes to a unicode object using the appropriate encoding. If the
> bytes are utf-8, then you'd do something like this:
> unicode_filename = unicode(filename, 'utf-8')
>>> print(repr(filename))
"This is_a-test'FILE to Ensure$ that\x9c stuff^ works.txt.js"
>>> fileName = unicode(filename, 'utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 35: invalid start byte
>>> fileName = unicode(filename, 'utf-16')
>>> fileName
u'\u6854\u7369\u6920\u5f73\u2d61\u6574\u7473\u4627\u4c49\u2045\u6f74\u4520\u736e\u7275\u2465\u7420\u6168\u9c74\u7320\u7574\u6666\u205e\u6f77\u6b72\u2e73\u7874\u2e74\u736a'
So I now have a UTF-16 encoded string, but I still can't open it.
>>> codecs.open(fileName, 'r', 'utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\codecs.py", line 881, in open
file = __builtin__.open(filename, mode, buffering)
IOError: [Errno 2] No such file or directory: u'\u6854\u7369\u6920\u5f73\u2d61\u6574\u7473\u4627\u4c49\u2045\u6f74\u4520\u736e\u72
75\u2465\u7420\u6168\u9c74\u7320\u7574\u6666\u205e\u6f77\u6b72\u2e73\u7874\u2e74\u736a'
I presume I need to perform some kind of decode operation on it to open the file but then am I not basically going back to my starting point?
Apologies if I'm missing the obvious.
--
James
More information about the Tutor
mailing list