[Tutor] Absolute newbie - Transliteration

Magnus Lyckå magnus@thinkware.se
Thu May 22 22:22:01 2003


At 23:51 2003-05-20 -0700, David Rogers wrote:
>I'm an absolute newbie - this is my first attempt with Python or any 
>"real" language, so my advance apologies for any stupid comments.

You are very welcome, I hope you will enjoy Python as much
as we do.

>   I joined the list just to ask this question, after doing a little 
> searching in the list archives and the documentation and not being able 
> to find out what I want to know.

It's always nice with people who do their homework. :)

>I'm trying make scripts to transliterate a file from (Unicode) Cyrillic 
>characters to each of
>- Roman script, and
>- International Phonetic Alphabet (more Unicode).
>
>(Whether I end up with separate scripts, one for each transliteration, or 
>one script for all with a bigger dictionary/list/table, is not important 
>to me.)

>The transliteration will not always be one-to-one in terms of the number 
>of characters, for example the "ch" sound is one letter in Russian but 
>corresponds to two letters in English.

I have a feeling, it might not be completely trivial to do this at all.
But that depends... If it's always one russian letter being translated
into one or more roman / IPA symbol, it's no problem. I just consulted
my wife who knows Russian, and it seems you should be able to do this.
The only two letter combination she could think of is some "soft"
indicator that is only used to modify the preceeding consonant. I'm
sure you know this.

If you had gone the other direction, it would be much harder. For instance,
it's not trivial to determine if the letter combination "sh" is the sh-sound.
Compare "dishes" with "dishonor". Transliteration of English to cyrillic or
phonetic letters seems to reqire a lot of contextual aid...

If we ignore this little softener for a while, you really just need a
dictionary. Use cyrillic as keys, and the other script as value. A tiny
excerpt matching my knowledge of the cyrillic alphabet would be

cyr2rom = {'C': 'S', 'P': 'R'}

Then a simple (but far from optimal) algorithm would be:

rom_text = ""
cyr_text = "..."

for letter in cyr_text:
     rom_text = rom_text + cyr2rom[letter]


Of course, values in the dictionary might well be more than one character.

If you want to do both roman and IPA at once, you could do something
like:

cyr2rom_ipa = {'C': ('S', u'...'), ('P': ('R', u'...')}

rom_text = ""
ipa_text = ""
cyr_text = "..."

for letter in cyr_text:
     rom_text = rom_text + cyr2rom_ipa[letter][0]
     ipa_text = ipa_text + cyr2rom_ipa[letter][1]

Your main problem is that little softing symbol (that looks a bit
like 'b'). Somehow, you need to look ahead, to see if that's coming
after the current consonant, or perhaps it's easier to handle that
whe it comes, and make a correction after the fact.

While we're at it, doing "s = s + c" with strings in a big loop, is
very ineffective, since you create new string objects all the time.
It's much better to use a list, append to that, and to turn it into
a string when all the processing is done. This also helps this soft-
fix. Then it will look something like this:

cyr2rom_ipa = {'C': ('S', u'...'), ('P': ('R', u'...')}
soft_cyr2rom_ipa = {'N': ('NJ', u'...'), ...}
soft_symbol = u'...'

rom_text = []
ipa_text = []
cyr_text = "..."

for letter in cyr_text:
     if letter == soft_symbol:
         'Replace the last letter with a soft version. For
         'roman I guess it's something like "N => NJ".
         rom_text[-1] = soft_cyr2rom_ipa[rom_text[-1]][0]
         ipa_text[-1] = soft_cyr2rom_ipa[rom_text[-1]][1]
     else:
         rom_text.append(cyr2rom_ipa[letter][0])
         ipa_text.append(cyr2rom_ipa[letter][1])

rom_text = "".join(rom_text)
ipa_text = "".join(ipa_text)

Perhaps it's better to use the previous cyrillic letter rather
than the (hard) one that you translated to for the softening.
That means that the keys to soft_cyr2rom_ipa will be different,
and you need to keep the previous cyrillic letter in a variable.

cyr2rom_ipa = {'C': ('S', u'...'), ('P': ('R', u'...')}
soft_cyr2rom_ipa = {u'...': ('NJ', u'...'), ...}
soft_symbol = u'...'

rom_text = []
ipa_text = []
cyr_text = "..."

for letter in cyr_text:
     if letter == soft_symbol:
         #Replace the last letter with a soft version. For
         #roman I guess it's something like "N => NJ".
         #This should always come after a consonant. (Always last in word?)
         rom_text[-1] = soft_cyr2rom_ipa[previous_letter ][0]
         ipa_text[-1] = soft_cyr2rom_ipa[previous_letter ][1]
     else:
         rom_text.append(cyr2rom_ipa[letter][0])
         ipa_text.append(cyr2rom_ipa[letter][1])
     previous_letter = letter

rom_text = "".join(rom_text)
ipa_text = "".join(ipa_text)

>I have found the following in the Python web documentation...
>
>>translate(table[, deletechars])

Never mind that as Bob said.

>>Return a copy of the string where all characters occurring in the 
>>optional argument deletechars are removed, and the remaining characters 
>>have been mapped through the given translation table, which must be a 
>>string of length 256.

This isn't the best piece of documentation in Pythondom.

It's a hint that the table must be a 256 character string though.
Each symbol in a normal 8 bit string has a value between 0 and 255.
E.g.

 >>> ord('A')
65
 >>> chr(65)
'A'

What happens is that the numeric value of each character in
your text string is used to find a character in the "table".
That's the replacement value.

So, translate basically does this:

def translate(s, table, deletechars=""):
     result = ""
     for char in s:
         if char not in deletechars:
             result = result + table[ord(char)]
     return result

if table = "A" * 256, then all will be translated to "A". If
table = "".join([chr((x+5)%256) for x in range(256)])
then A => F, b => G etc.

>...but I don't understand what format my table needs to be in, or even if 
>this accommodates Unicode, or the problem of one character sometimes 
>translating to two.  If I'm completely on the wrong track here, somebody 
>laugh now before it's too late.   :-)

I hope I led you onto the track again.

>What I don't want is a pointer to a non-modifiable Cyrillic-to-Roman 
>transliteration application, because I want to re-use what I do here when 
>I make other transliteration tables to speed up IPA transcription from 
>other languages too.  I love IPA.    :-)

I think most other languages are much, much harder than Russian. :(

English is hopeless. Laugh, Garage, Women... Swedish is fairly hopeless
as well.

I think you realize by now (if not before) that the amount of shared
code for a thing like this is fairly small. From Russian seems to be
truly trivial compared to translitteration from most western European
languages. For English, you would need to build in a major understanding
of the language. I don't know if the information you need to include can
be described in a much shorter format than the output you would generate
from a really big word list. And if that's the case, it's obviously rather
futile... I assume there is linguistic research done in that sector though.
Danny Yoo usually knows these things...


--
Magnus Lycka (It's really Lyckå), magnus@thinkware.se
Thinkware AB, Sweden, www.thinkware.se
I code Python ~ The shortest path from thought to working program