how to handle surrogate encoding: read from fs write to database

Steven D'Aprano steve at pearwood.info
Sun Jun 12 12:50:24 EDT 2016


On Sun, 12 Jun 2016 10:09 pm, Peter Volkov wrote:

> Hi, everybody.
> 
> What is a best practice to deal with filenames in python3? The problem is
> that os.walk(src_dir), os.listdir(src_dir), ... return "surrogate" strings
> as filenames.

Can you give an example?



> It is impossible to assume that they are normal strings that 
> could be print()'ed on unicode terminal or saved as as string into
> database (mongodb) as they'll issue UnicodeEncodeError on surrogate
> character. So, how to handle this situation?

Use a better OS :-)

I believe that Mac OS X handles this the right way. If I understand
correctly, its preferred file system, HFS+, will only allow valid UTF-8
strings as file names. So you cannot get invalid Unicode strings containing
surrogates on Apple systems (unless you read from a non-HFS+ disk).

I think Windows also gets it almost write: NTFS uses UTF-16, and (I think)
only allow valid Unicode file names.

Its only Unix file systems, including Linux, that allows arbitrary bytes
(except for / and \0) in file names, so file names can be invalid Unicode,
including surrogates.

All the terminals I know of on Linux will print "bad" file names. They will
be ugly, with control characters inside them, or invisible characters that
cannot even be seen, but they will print.

> The first solution I found was to convert filenames to bytes and use them.

I think that's the only real solution. The file names on disk actually are
bytes, and they're invalid Unicode, so it shouldn't surprise you if the
only way to deal with them losslessly is to keep them as bytes.

Another way is to simply refuse to process those files. Filenames with
broken Unicode are, arguably, broken and shouldn't be allowed. It's your
application, you can specify how files have to be named. Even if your
operating system allows it, your application can refuse to deal with them:
either just skip those files, or raise an error, or insist that the user
renames them to something usable. Perhaps you can even repair the file name
yourself, by deleting or replacing the surrogates.


> But that's not nice. Once I need to compare filename with some string I'll
> have to convert strings to bytes.

That's not hard. 'my filename.txt'.encode('utf-8') # or whatever encoding
your file system uses


> Also Bytes() objects are base64 encoded 
> in mongo shell and thus they are hard to read, 

Use a better database :-)


> *e.g. "binary" : 
> BinData(0,"c29tZSBiaW5hcnkgdGV4dA==")*. Finally PEP 383 states that using
> bytes does not work in windows (btw, why?).

Windows file systems, at least NTFS, uses UTF-16 file names. But that
shouldn't matter to you: all that means is that when you read files names
from Windows, you should never see any surrogates. (I think. I don't have
access to a Windows machine I can test this.)


> Another option I found is to work with filenames as surrogate strings but
> enc them to 'latin-1' before printing/saving into database:
>     filename.encode(fse, errors='surrogateescape').decode('latin-1')

Do you want mojibake? Because that's how you get mojibake.

py> 'µPy'.encode('utf-8').decode('latin1')
'µPy'
py> 'αω'.encode('utf-8').decode('latin1')
'αÏ\x89'

You are mapping the full range of 1114112 distinct Unicode code points into
just 256 Latin-1 characters. Bad Things happen.

> This way I like more since latin symbols are clearly visible in mongo
> shell. Yet I doubt this is best solution.

It certainly isn't.


> Ideally I would like to send surrogate strings to database or to terminal
> as is and let db/terminal handle them. IOW let terminal print garbage
> where surrogate letters appear. Is this possible in python?

That's nothing to do with Python, it depends on the database, and the
terminal.

> So what do you think: is  usage unicode strings and explicit conversion to
> latin-1 a good option?

Absolutely not.



> Also related question: is it possible to detect surrogate symbols in
> strings?

any('\uD800' <= c <= '\uDFFF' for c in the_string)



-- 
Steven



More information about the Python-list mailing list