I want to use this encoding https://github.com/vanangamudi/tace16-utf8-converter/blob/master/tace16.py for Tamil language text because it is more consistent with the language's nature, and Unicode encoding severely damages(read more here https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding) the intrinsic features of the fusion of alphabets.
a specific example would be when dealing with regex. *'^[சிகு]'* is the intended expression for lines that starts with either 'சி' or 'கு' just like how in English '^[ab]' matches lines that start with either 'a' or 'b'. But since Unicode represents some of the eastern languages with multiple code points, '^[சிகு]' basically translates to '^[ச,ி,க,ு]' (using the commas for clarity) சி -> ச,ி and கு -> க,ு . Running the expression over a few words in python, gives the results as shown in the attached image.
Note: expected results can be obtained by using this expression '^(சி|கு)' but this works for this specific case, but there should be a way to write expressions to match சிசிசிகுகுசிகு?
regex in tamil is not a python issue. it is a unicode issue. I suppose if I can encode Tamil text in TACE16 encoding, I can use regex directly over it since (as I understand) re module runs matching over bytes.
Two basic questions,
1. How do I approach writing a new text encoding codec for python and register it with the codec module. 2. How would I convert utf-8 encoded pattern for regex into the custom codec so that the pattern and input string for re.match/search is consistent.