Translation table to map Latin-1 to ASCII?

Rene Pijlman reageer.in at de.nieuwsgroep
Sun Jan 26 07:54:53 EST 2003


John Machin:
>Rene Pijlman:
>> Can anyone point me to a translation table for string.translate
>> to map Latin-1 (ISO 8859-1) to ASCII such that \"e maps to e
>> etc.?
>
>The translation table would depend on what you want to use it for; you
>are smashing 256 different characters into 128, so you have a choice
>of losing information or emitting at least two characters per input
>character. 

I want to use it for log file analysis and statistical reporting
of an ht://Dig search engine. I've configured the search engine
itself to use the 'accents' algorithm: "This algorithm will
treat all accented letters as equivalent to their unaccented
counterparts." http://www.htdig.org/attrs.html

Now I want to mimic this behavior in the log analyzer that I'm
writing in Python. 

For example, if someone from Germany searches for \"ubersetzung
and someone from Holland searches for ubersetzung I want those
to count as two searches for the same search phrase in the
search statistics.
 
>What do you want to do with all the non-alphabetic characters?

Perhaps I should rephrase the challenge: I want to map Latin-1
to Latin-1, 1-to-1 for most characters, but with accented
letters mapped to their unaccented counterparts.

>I trust this isn't a Python specific question i.e. if someone gave you
>a translation table to be used in C or some other language, you'd be
>able to Pythonise it. 

Sure, and I can also write it myself or get it from ht://Dig's
source code. But I'm lazy and I was just hoping that someone
would have a similar translation table for Python's
string.translate lying around :-)

I found the following solution in this group's archive on
Google. If only I would understand how it works :-)

import unicodedata, string

def maketable():
    # build iso-latin-1 to "undotted" ascii translation table
    table = range(256)
    for i in table:
        x = unicodedata.decomposition(unichr(i))
        if x and x[0] == "0":
            table[i] = int(x.split()[0], 16)
    return string.join(map(chr, table), "")

text = "text with accented characters"

noaccents = maketable()

undotted_text = string.translate(text, noaccents)

print undotted_text

-- 
René Pijlman

Wat wil jij leren?  http://www.leren.nl




More information about the Python-list mailing list