[issue19519] Parser: don't transcode input string to UTF-8 if it is already encoded to UTF-8

Thu Nov 7 13:40:55 CET 2013

New submission from STINNER Victor:

Python parser (Parser/tokenizer.c) has a translate_into_utf8() function to decode a string from the input encoding and encode it to UTF-8.

This function is unnecessary if the input string is already encoded to UTF-8, which is something common nowadays. Linux, Mac OS X and many other operating systems are now using UTF-8 as the default locale encoding, UTF-8 is the default encoding for Python scripts, etc. compile(), eval() and exec() functions pass UTF-8 encoded strings to the parser.

Attached patch adds an input_is_utf8 flag to the tokenizer to skip translate_into_utf8() if the input string is already encoded to UTF-8.

----------
files: input_is_utf8.patch
keywords: patch
messages: 202331
nosy: benjamin.peterson, haypo, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Parser: don't transcode input string to UTF-8 if it is already encoded to UTF-8
type: performance
versions: Python 3.4
Added file: http://bugs.python.org/file32526/input_is_utf8.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue19519>
_______________________________________