Unicode File Names

Jordan jordan.taylor2 at gmail.com
Thu Oct 16 23:56:19 EDT 2008


On Oct 16, 10:18 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Oct 17, 12:52 pm, Jordan <jordan.tayl... at gmail.com> wrote:
>
>
>
> > On Oct 16, 9:20 pm, John Machin <sjmac... at lexicon.net> wrote:
>
> > > On Oct 17, 11:43 am, Jordan <jordan.tayl... at gmail.com> wrote:
>
> > > > I've got a bunch of files with Japanese characters in their names and
> > > > os.listdir() replaces those characters with ?'s. I'm trying to open
> > > > the files several steps later, and obviously Python isn't going to
> > > > find '01-????.jpg' (formally '01-ひらがな.jpg') because it doesn't exist.
> > > > I'm not sure where in the process I'm able to stop that from
> > > > happening. Thanks.
>
> > > The Fine Manual says:
> > > """
> > > listdir( path)
>
> > > Return a list containing the names of the entries in the directory.
> > > The list is in arbitrary order. It does not include the special
> > > entries '.' and '..' even if they are present in the directory.
> > > Availability: Macintosh, Unix, Windows.
> > > Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a
> > > Unicode object, the result will be a list of Unicode objects.
> > > """
>
> > > Are you unsure whether your version of Python is 2.3 or later?
>
> > *** Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32
> > bit (Intel)] on win32. *** says my interpreter
>
> > when it says "if path is a Unicode object...", does that mean the path
> > name must have a Unicode char?
>
> If path is a Unicode [should read unicode] object of length > 0, then
> *all* characters in path are by definition unicode characters.
>
> Where are you getting your path from? If you are doing os.listdir(r'c:
> \test') then do os.listdir(ur'c:\test'). If you are getting it from
> the command line or somehow else as a variable, instead of
> os.listdir(path), try os.listdir(unicode(path)). If that fails with a
> message like "UnicodeDecodeError: 'ascii' codec can't decode .....",
> then you'll need something like os.listdir(unicode(path,
> encoding='cp1252')) # cp1252 being the most likely suspect :)
>
> I strongly suggest that you read this:
>    http://www.amk.ca/python/howto/unicode
> which contains lots of useful information, including an answer to your
> original question.
Thanks go to Chris and John for starting me off in the right
direction.

I'm not quite sure now if the problem is me, windows, or zipfile
(which I kinda failed to mention before). Using
os.listdir(unicode(os.listdir())) seems to have been a step in the
right direction (thanks Chris and John). When testing things in the
python interpreter, I don't seem to hit issues after using the above
mentioned line.

[code]
>>> l = os.listdir(unicode(os.getcwd()))
>>> l
u'01-\u3072\u3089\u304c\u306a.jpg'
u'02-\u3072\u3089\u304c\u306a.jpg'
u'03-\u3072\u3089\u304c\u306a.jpg'

>>>for thing in l:
...    print thing
01-ひらがな.jpg
02-ひらがな.jpg
03-ひらがな.jpg

[/code]
Yay.

Having a file that tries "for thing in l: print thing" fails with:

  File "C:\Python25\Lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in
position 13-16: character maps to <undefined>

I'm perfectly willing to let command prompt refuse to print that (it's
debugging only) if the next issue was resolved >_>:

"""
Note: There is no official file name encoding for ZIP files. If you
have unicode file names, please convert them to byte strings in your
desired encoding before passing them to write(). WinZip interprets all
file names as encoded in CP437, also known as DOS Latin.
"""

I'm simply not sure what this means and how to deal with it.



More information about the Python-list mailing list