[Patches] [ python-Patches-1313939 ] Speedup PyUnicode_DecodeCharmap

Thu Oct 6 20:37:23 CEST 2005

Patches item #1313939, was opened at 2005-10-05 17:01
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1313939&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Library (Lib)
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: Nobody/Anonymous (nobody)
Summary: Speedup PyUnicode_DecodeCharmap

Initial Comment:
This patch speeds up PyUnicode_DecodeCharmap() as
discussed in the thread:
http://mail.python.org/pipermail/python-dev/2005-October/056958.html

It makes it possible to pass a unicode string to
cPyUnicode_DecodeCharmap() in addition to the
dictionary which is still supported. The unicode
character at position i in the string is used as the
decoded value for byte i. Byte values greater that the
length of the string and u"\ufffd" characters in the
string are treated as "maps to undefined".

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-06 20:37

Message:
Logged In: YES 
user_id=38388

Yes, please.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-10-06 18:59

Message:
Logged In: YES 
user_id=89016

> I can regenerate the codecs using gencodec.py, no problem. I
> can also change it to create the string mapping.

That would be great.

So should I check in everything else (i.e. unicodeobject.c
and the doc and test changes)?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-06 18:02

Message:
Logged In: YES 
user_id=38388

Thanks for the change.

I can regenerate the codecs using gencodec.py, no problem. I
can also change it to create the string mapping.

For reference: the mapping files can be downloaded from
ftp.unicode.org. The gencodec.py script then takes the
mapping filename as argument and creates a codec .py file
from it.

Special care has to be taken in that some codecs contains
hand-edited details.

Note that it's likely that some codecs will have additions
or removals - the files on the ftp.unicode.org are updated
every now and then and usually contain more up-to-date mappings.

For some codecs, you won't find mapping files on the Unicode
site - these were then contributed by 3rd parties.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-10-06 17:50

Message:
Logged In: YES 
user_id=89016

> Thanks, Walter. However, you won't get my approval with the
> choice of FFFD as meaning "undefined mapping" - that code
> point is defined. Please choose a different code that is
> documented to never be defined by the Unicode standard.

OK, I've updated the patch to use 0xfffe instead. Note that
this only work as long as u"\fffe" is a legal Unicode literal.

> Also, please explain the new alias 'unicode_1_1_utf_7'  :
> 'utf_7'.

Oops, that was for bug #1245379. Removed

> About the make_maps() function: the decoding maps should be
> generated by the gencodec.py script instead of doing this a
> import time. The dictionaries can then be removed from the
> codecs.

That's true. I'm still working on that. Do you have any tips
on how to do that (what files do I have to download and
where do I have to put it and how (and from where) do I have
to call gencodec.py). Is this documented somewhere?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-06 17:24

Message:
Logged In: YES 
user_id=38388

Thanks, Walter. However, you won't get my approval with the
choice of FFFD as meaning "undefined mapping" - that code
point is defined. Please choose a different code that is
documented to never be defined by the Unicode standard.

Also, please explain the new alias 'unicode_1_1_utf_7'  :
'utf_7'.

About the make_maps() function: the decoding maps should be
generated by the gencodec.py script instead of doing this a
import time. The dictionaries can then be removed from the
codecs.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-10-06 16:45

Message:
Logged In: YES 
user_id=89016

OK, I've updated the patch to include an update of the
documentation and a few test and I've simplified
codecs.make_maps a bit, since we'll always have a decoding
string. 0xFFFD is still used as the undefined marker value.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-05 22:39

Message:
Logged In: YES 
user_id=38388

The whole point is finding a replacement code point for the
None value in dictionaries. Since None is not a character, a
code point should be chosen that is guaranteed to never be
assigned. FFFE is such a code point, hence the choice.  FFFD
is an assigned code point.

Note that a mapping to FFFE will always raise an exception
and the codec user can then decide to use the replace error
handler to have the codec use FFFD instead.

It is also very reasonable for a codec to map some
characters to FFFD to avoid invoking the exception handling
in those cases.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2005-10-05 20:36

Message:
Logged In: YES 
user_id=21627

For decoding, Walter's code is nearly identical to the
fastmap decoder: both use a Py_UNICODE array to represent
the map, and both use REPLACEMENT CHARACTER to denote a
missing target code.

I find the use of U+FFFD highly appropriate, and not at all
debatable. None of the existing codecs maps any of its
characters to U+FFFD, and I would consider it a bug if one
did. REPLACEMENT CHARACTER should only be used if there is
no approprate character, so no charmap should claim that the
appropriate mapping for some by is that character.

That you often use U+FFFD in output to denote unmappable
characters is a different issue, indeed, Python's "replace"
mode does so. It would continue to do so under this patch.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-05 19:50

Message:
Logged In: YES 
user_id=38388

The patch looks good, but I'd still like to see whether
Hye-Shik's fastmap codec wouldn't be a better and more
general solution since it seems to also provide good
performance for encoding Unicode strings.

That said, you should use a non-code point such as 0xFFFE
for meaning "undefined mapping". The Unicode replacement
character is not a good choice as this is a very valid
character which often actually gets used to replace
characters for which no Unicode code point is known.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1313939&group_id=5470