os.walk the apostrophe and unicode
Rod Person
rodperson at rodperson.com
Sat Jun 24 15:47:25 EDT 2017
On Sat, 24 Jun 2017 13:28:55 -0600
Michael Torrie <torriem at gmail.com> wrote:
> On 06/24/2017 12:57 PM, Rod Person wrote:
> > Hi,
> >
> > I'm working on a program that will walk a file system and clean the
> > id3 tags of mp3 and flac files, everything is working great until
> > the follow file is found
> >
> > '06 - Todd's Song (Post-Spiderland Song in Progress).flac'
> >
> > for some reason that I can't understand os.walk() returns this file
> > name as
> >
> > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
> > Progress).flac'
>
> That's basically a UTF-8 string there:
>
> $ python3
> >>> a= b'06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in
> Progress).flac'
> >>> print (a.decode('utf-8'))
> 06 - Todd’s Song (Post-Spiderland Song in Progress).flac
> >>>
>
> The NAS is just happily reading the UTF-8 bytes and passing them on
> the wire.
>
> > which then causes more hell than a little bit for me. I'm not
> > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do
> > about it.
>
> It's clearly not an apostrophe in the original filename, but probably
> U+2019 (’)
>
> > The script is Python 3, the file system it is running on is a hammer
> > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS
> > which runs some kind of Linux so it probably ext3/4. The files came
> > from various system (Mac, Windows, FreeBSD).
>
> It's the file serving protocol that dictates how filenames are
> transmitted. In your case it's probably smb. smb (samba) is just
> passing the native bytes along from the file system. Since you know
> the native file system is just UTF-8, you can just decode every
> filename from utf-8 bytes into unicode.
This is the impression that I was under, my unicode is that strong, so
maybe my understand is off...but I tried.
file_name = file_name.decode('utf-8', 'ignore')
but when I get to my logging code:
logfile.write(file_name)
that throws the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 39-41: ordinal not in range(128)
--
Rod
http://www.rodperson.com
Who at Clitorius fountain thirst remove
Loath Wine and, abstinent, meer Water love.
- Ovid
More information about the Python-list
mailing list