[Patches] [Patch #103100] Multicharacter replacements in PyUnicode_TranslateCharmap

Fri, 05 Jan 2001 09:07:35 -0800

Patch #103100 has been updated. 

Project: python
Category: core (C code)
Status: Open
Submitted by: doerwalter
Assigned to : lemburg
Summary: Multicharacter replacements in PyUnicode_TranslateCharmap

Follow-Ups:

Date: 2001-Jan-05 09:07
By: doerwalter

Comment:
The problem, that you can't know beforehand how long
the result string will be, i.e. if there really will be any 1-n
replacements happening.

It would be possible to do a loop through the replacement
strings and see if there are any that are longer than one character,
but even if there are, you don't know if they will really be used.

So you have three choices:
(1) You either guess how much space you need and reallocate
when the space is not enough or 
(2) you do a dry run of the algorithm once and count how much 
space you need and do the algorithm a second time and this 
time use the strings.
(3) you can keep the strings in a list and join the list into
one string in the end.

For the case of 1-1 mapping the following will happen:

(1) The first allocation has exactly the right amount of space, 
there won't be any reallocations, but a size check for every
character will be don (which should be only a few assembler instructions).
The mapping will have to be accessed for every character
in the source string once.

(2) There will only be one allocation, but for every character in
the source string, the mapping has to be accessed twice, which
are calls to Python function, exception handling etc.

(3) You have to make as many memory allocations are are parts
of the final string that you create, including error handling etc.

I think (1) is clearly the fastest method.

-------------------------------------------------------

Date: 2001-Jan-04 10:33
By: nobody

Comment:
I like the idea, but the implementation needs some reworking:
the common case is 1-1 mapping so this should be as fast
as possible; extra size checks slow things down too much.

You can take a different approach, though:
leave things as they are and only add a special case for the 1-n
which does resizing depending on how many extra chars are inserted.
Then as final step, if resizing occurred, call _PyUnicode_Resize()
to cut down the allocate buffer to its true size.

-- Marc-Andre
-------------------------------------------------------

-------------------------------------------------------
For more info, visit:

http://sourceforge.net/patch/?func=detailpatch&patch_id=103100&group_id=5470