[Python-ideas] TACE16 text encoding for Tamil language

30 Apr 2021

      Hi,

I want to use this encoding
https://github.com/vanangamudi/tace16-utf8-converter/blob/master/tace16.py[1]
for Tamil language text because it is more consistent with the language's
nature, and Unicode encoding severely damages(read more here[2]
https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding) the intrinsic
features of the fusion of alphabets.

a specific example would be when dealing with regex. *'^[சிகு]'* is the
intended expression for lines that starts with either 'சி' or 'கு' just
like how in English '^[ab]' matches lines that start with either 'a' or
'b'. But since Unicode represents some of the eastern languages with
multiple code points, '^[சிகு]' basically translates to  '^[ச,ி,க,ு]'
(using the commas for clarity) சி -> ச,ி and கு -> க,ு . Running the
expression over a few words in python, gives the results as shown in the
attached image.

Note: expected results can be obtained by using this expression '^(சி|கு)'
but this works for this specific case, but there should be a way to write
expressions to match சிசிசிகுகுசிகு?

regex in tamil is not a python issue. it is a unicode issue. I suppose if I
can encode Tamil text in TACE16 encoding, I can use regex directly over it
since (as I understand) re module runs matching over bytes.

Two basic questions,

   1. How do I approach writing a new text encoding codec for python and
   register it with the codec module.
   2. How would I convert utf-8 encoded pattern for regex into the custom
   codec so that the pattern and input string for re.match/search is
   consistent.

Links:
[1]
https://github.com/vanangamudi/tace16-utf8-converter/blob/master/tace16.py
[2] https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding

-- 
Thanks,
வணங்காமுடி
(vanangamudi)

[Python-ideas] TACE16 text encoding for Tamil language

பா. மு. செல்வக்குமார்