[Tutor] Opening filenames with unicode characters

James Chapman james at uplinkzero.com
Thu Jun 28 21:48:19 CEST 2012


Informative thanks Jerry, however I'm not out of the woods yet.

 
> Here's a couple of questions that you'll need to answer 'Yes' to
> before you're going to get this to work reliably:
> 
> Are you familiar with the differences between byte strings and unicode
> strings? 

I think so, although I'm probably missing key bits of information.

> Do you understand how to convert from one to the other,
> using a particular encoding?  

No not really. This is something that's still very new to me.

> Do you know what encoding your source
> file is saved in? 

The name of the file I'm trying to open comes from a UTF-16 encoded text file, I'm then using regex to extract the string (filename) I need to open. However, all the examples I've been using here are just typed into the python console, meaning string source at this stage is largely irrelevant.

> If your string is not coming from a source file,
> but some other source of bytes, do you know what encoding those bytes
> are using?
> 
> Try the following.  Before trying to convert filename to unicode, do a
> "print repr(filename)".  That will show you the byte string, along
> with the numeric codes for the non-ascii parts.  Then convert those
> bytes to a unicode object using the appropriate encoding.  If the
> bytes are utf-8, then you'd do something like this:
> unicode_filename = unicode(filename, 'utf-8')

>>> print(repr(filename))
"This is_a-test'FILE to Ensure$ that\x9c stuff^ works.txt.js"

>>> fileName = unicode(filename, 'utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 35: invalid start byte

>>> fileName = unicode(filename, 'utf-16')

>>> fileName
u'\u6854\u7369\u6920\u5f73\u2d61\u6574\u7473\u4627\u4c49\u2045\u6f74\u4520\u736e\u7275\u2465\u7420\u6168\u9c74\u7320\u7574\u6666\u205e\u6f77\u6b72\u2e73\u7874\u2e74\u736a'



So I now have a UTF-16 encoded string, but I still can't open it.

>>> codecs.open(fileName, 'r', 'utf-16')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\codecs.py", line 881, in open
    file = __builtin__.open(filename, mode, buffering)
IOError: [Errno 2] No such file or directory: u'\u6854\u7369\u6920\u5f73\u2d61\u6574\u7473\u4627\u4c49\u2045\u6f74\u4520\u736e\u72
75\u2465\u7420\u6168\u9c74\u7320\u7574\u6666\u205e\u6f77\u6b72\u2e73\u7874\u2e74\u736a'


I presume I need to perform some kind of decode operation on it to open the file but then am I not basically going back to my starting point?

Apologies if I'm missing the obvious.

--
James




More information about the Tutor mailing list