[issue17110] sys.argv docs should explaining how to handle encoding issues
New submission from Nick Coghlan: The sys.argv docs [1] currently contain no mention of the fact that they are Unicode strings decoded from bytes provided by the OS. They also don't explain how to correct a decoding error by reversing Python's implicit conversion and redoing it based on the application's knowledge of the correct encoding, as described at [2] [1] http://docs.python.org/3/library/sys#sys.argv [2] http://stackoverflow.com/questions/6981594/sys-argv-as-bytes-in-python-3k/ ---------- assignee: docs@python components: Documentation, Unicode messages: 181239 nosy: docs@python, ezio.melotti, ncoghlan priority: normal severity: normal stage: needs patch status: open title: sys.argv docs should explaining how to handle encoding issues type: enhancement versions: Python 3.2, Python 3.3, Python 3.4 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue17110> _______________________________________
Changes by Arfrever Frehtes Taifersar Arahesis <Arfrever.FTA@GMail.Com>: ---------- nosy: +Arfrever _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue17110> _______________________________________
Sreepriya Chalakkal added the comment: I tried running with Python 3.4 the following code import sys print(sys.argv[1]) print(b'bytes') And I ran as follows trying to run with a different encoding. $ python ~/a.py `echo priya|iconv -t latin1` priya bytes There was no unicode encode error generated! Is it because the problem is fixed? ---------- nosy: +sreepriya _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue17110> _______________________________________
Antoine Pitrou added the comment:
There was no unicode encode error generated! Is it because the problem is fixed?
No, it's not fixed. First, it seems you are testing with Python 2 (otherwise you would get "b'bytes'", not "bytes"). Python 2 won't have a problem here, since it treats everything as bytestrings. Second, to evidence the issue you must pass a non-ASCII string. For example: $ ./python a.py `echo éléphant|iconv -t latin1` Traceback (most recent call last): File "a.py", line 4, in <module> print(sys.argv[1]) UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 0: surrogates not allowed ---------- nosy: +pitrou _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue17110> _______________________________________
Sreepriya Chalakkal added the comment: You are right. Instead of running ./python inside the python directory, I ran the default python of older version! Based on the stackoverflow link given, I tried to make some documentation. I am attaching the patch! ---------- keywords: +patch Added file: http://bugs.python.org/file34470/Issue17110.patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue17110> _______________________________________
Changes by andy.ma <andy.junma@gmail.com>: ---------- nosy: +andyma _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue17110> _______________________________________
Antoine Pitrou added the comment: Hmm, I'm not sure where those explanations belong but I'm not sure should be in the sys module docs (especially as they are quite lengthy, and they also apply to other data such as os.environ). Perhaps the Unicode HOWTO? ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue17110> _______________________________________
Change by Inada Naoki <songofacandy@gmail.com>: ---------- pull_requests: +12542 stage: needs patch -> patch review _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue17110> _______________________________________
Inada Naoki <songofacandy@gmail.com> added the comment: New changeset 38f4e468d4b55551e135c67337c18ae142193ba8 by Inada Naoki in branch 'master': bpo-17110: doc: add note how to get bytes from sys.argv (GH-12602) https://github.com/python/cpython/commit/38f4e468d4b55551e135c67337c18ae1421... ---------- nosy: +inada.naoki _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue17110> _______________________________________
Change by miss-islington <mariatta.wijaya+miss-islington@gmail.com>: ---------- pull_requests: +12559 _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue17110> _______________________________________
miss-islington <mariatta.wijaya+miss-islington@gmail.com> added the comment: New changeset 5b80cb5584a72044424f2d82d0ae79c720f24c47 by Miss Islington (bot) in branch '3.7': bpo-17110: doc: add note how to get bytes from sys.argv (GH-12602) https://github.com/python/cpython/commit/5b80cb5584a72044424f2d82d0ae79c720f... ---------- nosy: +miss-islington _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue17110> _______________________________________
Change by Inada Naoki <songofacandy@gmail.com>: ---------- resolution: -> fixed stage: patch review -> resolved status: open -> closed versions: +Python 3.7, Python 3.8 -Python 3.2, Python 3.3, Python 3.4 _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue17110> _______________________________________
Manuel Jacob <me@manueljacob.de> added the comment: The actual startup code uses Py_DecodeLocale() for converting argv from bytes to unicode. Since which Python version is it guaranteed that Py_DecodeLocale() and os.fsencode() roundtrip? ---------- nosy: +mjacob _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue17110> _______________________________________
Inada Naoki <songofacandy@gmail.com> added the comment: There is no strict guarantee. I think ASCII, UTF-8, latin1 with surrogateescape guarantee roundtrip. Other legacy encodings like cp932 may not roundtrip. But it is not a huge problem because only Windows use them typically. On Windows: * wchar_t is used in most case, instead of fsencoding * fsencoding is now UTF-8 by default In other words, if you are using legacy encoding on Unix, it may be not roundtripping. ---------- _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue17110> _______________________________________
Manuel Jacob <me@manueljacob.de> added the comment: If the encoding supports it, since which Python version do Py_DecodeLocale() and os.fsencode() roundtrip? The background of my question is that Mercurial goes some extra rounds to determine the correct encoding to emulate what Py_EncodeLocale() would do: https://www.mercurial-scm.org/repo/hg/file/5.4.1/mercurial/pycompat.py#l157 . If os.fsencode() could be used, it would simplify the code. Mercurial supports Python 3.5+. ---------- _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue17110> _______________________________________
Inada Naoki <songofacandy@gmail.com> added the comment:
Manuel Jacob <me@manueljacob.de> added the comment:
If the encoding supports it, since which Python version do Py_DecodeLocale() and os.fsencode() roundtrip?
Maybe, since Python 3.2. FWIW, fsencode is added by Victor in https://bugs.python.org/issue8514
The background of my question is that Mercurial goes some extra rounds to determine the correct encoding to emulate what Py_EncodeLocale() would do: https://www.mercurial-scm.org/repo/hg/file/5.4.1/mercurial/pycompat.py#l157 . If os.fsencode() could be used, it would simplify the code. Mercurial supports Python 3.5+.
I think it is a right approach. One of the important use case of os.fsencode is using file path from sys.argv even if it can not be decoded by filesystem encoding. ---------- _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue17110> _______________________________________
participants (8)
-
andy.ma
-
Antoine Pitrou
-
Arfrever Frehtes Taifersar Arahesis
-
Inada Naoki
-
Manuel Jacob
-
miss-islington
-
Nick Coghlan
-
Sreepriya Chalakkal