Find and Replace Simplification
davea at davea.name
Sat Jul 20 20:04:51 CEST 2013
On 07/20/2013 01:03 PM, Joshua Landau wrote:
> On 20 July 2013 12:57, Serhiy Storchaka <storchaka at gmail.com> wrote:
>> 20.07.13 14:16, Joshua Landau написав(ла):
>>> However, some quick timing shows that translate has a very
>>> high penalty for missing characters and is a tad slower any way.
>>> Really, though, there should be no reason for .translate() to be
>>> slower than replace -- at worst it should just be "reduce(lambda s,
>>> ab: s.replace(*ab), mapping.items()¹, original_str)" and end up the
>>> *same* speed as iterated replace.
>> It doesn't work such way. Consider
> Still, it seems to me that it should be optimizable for sensible
> builtin types such that .translate is significantly faster, as there's
> no theoretical extra work that .translate *has* to do that .replace
> does not, and .replace also has to rebuild the string a lot of times.
translate is going to be faster (than replace) for Unicode if it has a
"large" table. For example, to translate from ASCII to EBCDIC, where
every character in the string is replaced by a new one. I have no idea
what the cutoff is. But of course, for a case like ASCII to EBCDIC, it
would be very tricky to do it with replaces, probably taking much more
than the expected 96 passes.
translate for byte strings is undoubtedly tons faster. For byte
strings, the translation table is 256 bytes, and the inner loop is a
simple lookup. But for Unicode, the table is a dict (or something very
like it, I looked at the C code, not the Python code).
So for every character in the input string, it does a dict-type lookup,
before it can even decide if the character is going to change.
Just for reference, the two files I was looking at were:
Extracted from the bz2 downloaded from the page:
More information about the Python-list