[Tutor] Opening filenames with unicode characters

Prasad, Ramit ramit.prasad at jpmorgan.com
Thu Jun 28 22:33:25 CEST 2012


> The name of the file I'm trying to open comes from a UTF-16 encoded text file,
> I'm then using regex to extract the string (filename) I need to open. However,
> all the examples I've been using here are just typed into the python console,
> meaning string source at this stage is largely irrelevant.
>
> >>> print(repr(filename))
> "This is_a-test'FILE to Ensure$ that\x9c stuff^ works.txt.js"
> 
> >>> fileName = unicode(filename, 'utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 35:
> invalid start byte
> 
> >>> fileName = unicode(filename, 'utf-16')
> 
> >>> fileName
> u'\u6854\u7369\u6920\u5f73\u2d61\u6574\u7473\u4627\u4c49\u2045\u6f74\u4520\u73
> 6e\u7275\u2465\u7420\u6168\u9c74\u7320\u7574\u6666\u205e\u6f77\u6b72\u2e73\u78
> 74\u2e74\u736a'

> So I now have a UTF-16 encoded string, but I still can't open it.
> 
> >>> codecs.open(fileName, 'r', 'utf-16')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "C:\Python27\lib\codecs.py", line 881, in open
>     file = __builtin__.open(filename, mode, buffering)
> IOError: [Errno 2] No such file or directory:
> u'\u6854\u7369\u6920\u5f73\u2d61\u6574\u7473\u4627\u4c49\u2045\u6f74\u4520\u73
> 6e\u72
> 75\u2465\u7420\u6168\u9c74\u7320\u7574\u6666\u205e\u6f77\u6b72\u2e73\u7874\u2e
> 74\u736a'
>

What happens if you use the filename as given above without converting?
("This is_a-test'FILE to Ensure$ that\x9c stuff^ works.txt.js" )
That works for me. Then just use codecs.open(filename).

If you also use codecs.open() for your UTF-16 source file then I 
think you would not need to worry about any conversion.

Oddly,
>>>"This is_a-test'FILE to Ensure$ that\x9c stuff^ works.txt.js".decode('utf16').encode('utf16')
"\xff\xfeThis is_a-test'FILE to Ensure$ that\x9c stuff^ works.txt.js" 
Not sure why that happens, but I assume it some kind of boundary issue. Maybe you can
just strip off the two first characters?
>>print '\xff\xfe'
ÿþ

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  


More information about the Tutor mailing list