[issue1552880] Unicode Imports
Kristján Valur Jónsson
report at bugs.python.org
Thu Sep 2 03:37:13 CEST 2010
Kristján Valur Jónsson <kristjan at ccpgames.com> added the comment:
> Yes, but in Python, U+DC80..D+DCFF range is used to store undecodable bytes.
> Eg. 'abc\xff'.decode('ascii', 'surrogateescape') gives 'abc\udcff'.
That's an inventive way of breaking the unicode standard :)
Anyway, why would you worry about that? My patch doesn't use "surrogateescape" so there is no problem. There are only two places where I "decode":
1) module names and sys.path components in the system file encoding: If they contain undecodable characters, then that is an error. No reason to propagate that error into the import machinery.
2) when decoding utf-8 back into unicode, but that utf-8 is already leagal since _we_ generated it.
If a _unicode_ input (sys.path) contains a valid surrogate pair, then the utf-8 encoder just encodes it.
But if it finds a lone surrogate as you describe (python special) then that represends an undecodable chacater, something that should have been covered earlier and something we know nothing about. Clearly, that makes that particular unicode sys.path component invalid.
(Hm, I notice that 2.7 happily encodes lone surrogates to utf-8)
> Python 2.7 is out and I think it is too late to fix Python2. Anyway, Python2
> uses bytes for sys.path or other paths, so the problem only occurs if the user
> specifies unicode paths.
Which is precisely the case that it is designed to solve. When the chinese user installs EVE Online in a weird folder, then that should work.
Also, 2.x is not quite dead yet. There are quite a few people doing their own patches for their private purposes. Although my patch won't go into any official version, there might be others in the same situation like us: Trying to support an _embedded_ python 2.x version in an internationalized enverionment (on windows :)
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list