String multi-replace

Dave Angel davea at ieee.org
Thu Nov 18 09:19:22 EST 2010


On 2:59 PM, Sorin Schwimmer wrote:
> Steven D'Aprano: the original file is 139MB (that's the typical size for it). Eliminating diacritics is just a little toping on the cake; the processing is something else.
>
> Thanks anyway for your suggestion,
> SxN
>
> PS Perhaps I should have mention that I'm on Python 2.7
>
>
In the message you were replying to, Steven had a much more important 
suggestion to make than the size one, and you apparently didn't notice 
it.  Chris made a similar implication.  I'll try a third time.

The file is obviously encoded, and you know the encoding.  Judging from 
the first entry in your table, it's in utf-8.  If so, then your approach 
is all wrong.  Treating it as a pile of bytes, and replacing pairs is 
likely to get you in trouble, since it's quite possible that you may 
get  a match with the last byte of one character and the first byte of 
another one.  If you substitute such a match, you'll make a hash of the 
whole region, and quite likely end up with a byte stream that is no 
longer even utf-8.

Fortunately, you can solve that problem, and simplify your code greatly 
in the bargain, by doing something like what was suggested by Steven.

Change your map of encoded bytes into unicode_nodia, using 
decode("utf-8") on the keys, and u"" on the values

Read in each line of the file, decode it to the unicode it represents, 
and do a simple translate once it's valid unicode.

Assuming the line is in utf-8, use
   uni = line.decode("utf-8")
   newuni = uni.trans(unicode_nodia)
   newutf8 = newuni.encode("utf-8")

incidentally, to see what a given byte pair in your table is, you can do 
something like:

    import unicodedata
    a = chr(196)+chr(130)
     unicodedata.name(a.decode("utf-8"))
'LATIN CAPITAL LETTER A WITH BREVE'



DaveA





More information about the Python-list mailing list