[Tutor] Unicode? UTF-8? UTF-16? WTF-8? ;)
eryksun at gmail.com
Wed Sep 5 16:31:16 CEST 2012
On Wed, Sep 5, 2012 at 5:42 AM, Ray Jones <crawlzone at gmail.com> wrote:
> I have directory names that contain Russian characters, Romanian
> characters, French characters, et al. When I search for a file using
> glob.glob(), I end up with stuff like \x93\x8c\xd1 in place of the
> directory names. I thought simply identifying them as Unicode would
> clear that up. Nope. Now I have stuff like \u0456\u0439\u043e.
This is just an FYI in case you were manually decoding. Since glob
calls os.listdir(dirname), you can get Unicode output if you call it
with a Unicode arg:
>>> t = u"\u0456\u0439\u043e"
>>> open(t, 'w').close()
>>> import glob
>>> glob.glob('*') # UTF-8 output
Regarding subprocess.Popen, just use Unicode -- at least on a POSIX
system. Popen calls an exec function, such as posix.execv, which
handles encoding Unicode arguments to the file system encoding.
On Windows, the _subprocess C extension in 2.x is limited to calling
CreateProcessA with char* 8-bit strings. So Unicode characters beyond
ASCII (the default encoding) trigger an encoding error.
More information about the Tutor