Changing filenames from Greeklish => Greek (subprocess complain)
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Wed Jun 5 01:56:36 EDT 2013
On Tue, 04 Jun 2013 10:23:33 -0700, Νικόλαος Κούρας wrote:
> What on eart is this damn error: Michael tried to explain to me about
> surrogates but dont think i understand it.
>
> Encoding giving me trouble years now.
>
> [Tue Jun 04 20:19:53 2013] [error] [client 46.12.95.59] Original
> exception was: [Tue Jun 04 20:19:53 2013] [error] [client 46.12.95.59]
> Traceback (most recent call last): [Tue Jun 04 20:19:53 2013] [error]
> [client 46.12.95.59] File "files.py", line 72, in <module> [Tue Jun 04
> 20:19:53 2013] [error] [client 46.12.95.59] cur.execute('''SELECT
> url FROM files WHERE url = %s''', (fullpath,) ) [Tue Jun 04 20:19:53
> 2013] [error] [client 46.12.95.59] File
> "/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/
cursors.py",
> line 108, in execute [Tue Jun 04 20:19:53 2013] [error] [client
> 46.12.95.59] query = query.encode(charset) [Tue Jun 04 20:19:53
> 2013] [error] [client 46.12.95.59] UnicodeEncodeError: 'utf-8' codec
> can't encode character '\\udcd3' in position 61: surrogates not allowed
>
>
>
> PLEASE TELL EM WHAT TO TRY, PLEASE FOR THE LOVE OF GOD, IAM SO
> FRUSTRATED NOT BEING ABLE TO DEAL WITH THIS.
Calm down. I know it is frustrating.
On a Linux system, the file system stores bytes, and only bytes. The file
system does no validation of the bytes you give, except to check that
there are no 0x00 and 0x2f bytes (ASCII '\0' and '/') in the file name.
That's all.
So, if one program thinks that it should be sending file names in, say,
UTF-16 or or ISO-8859-7 encoding, it will take a string like "Νικόλαος"
and the file system will see bytes like these:
py> s = 'Νικόλαος'
py> s.encode('UTF-16be')
b'\x03\x9d\x03\xb9\x03\xba\x03\xcc\x03\xbb\x03\xb1\x03\xbf\x03\xc2'
py> s.encode('iso-8859-7')
b'\xcd\xe9\xea\xfc\xeb\xe1\xef\xf2'
Notice that the same string gives you completely different bytes. And
likewise, the same bytes will give you different strings, depending on
the encoding you use.
Now, if you try to read the file name using a program that expects UTF-8,
it will either see some sort of mojibake garbage characters, or get some
sort of error:
py> s.encode('UTF-16be').decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 1:
invalid start byte
py> s.encode('iso-8859-7').decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 0:
invalid continuation byte
Somehow, I don't know how because I didn't see it happen, you have one or
more files in that directory where the file name as bytes is invalid when
decoded as UTF-8, but your system is set to use UTF-8. So to fix this you
need to rename the file using some tool that doesn't care quite so much
about encodings. Use the bash command line to rename each file in turn
until the problem goes away.
--
Steven
More information about the Python-list
mailing list