Hi,

I want to use this encoding[1] for Tamil language text because it is more consistent with the language's nature, and Unicode encoding severely damages(read more here[2]) the intrinsic features of the fusion of alphabets.

a specific example would be when dealing with regex. '^[சிகு]' is the intended expression for lines that starts with either 'சி' or 'கு' just like how in English '^[ab]' matches lines that start with either 'a' or 'b'. But since Unicode represents some of the eastern languages with multiple code points, '^[சிகு]' basically translates to  '^[ச,ி,க,ு]' (using the commas for clarity) சி -> ச,ி and கு -> க,ு . Running the expression over a few words in python, gives the results as shown in the attached image.

Note: expected results can be obtained by using this expression '^(சி|கு)' but this works for this specific case, but there should be a way to write expressions to match சிசிசிகுகுசிகு?

regex in tamil is not a python issue. it is a unicode issue. I suppose if I can encode Tamil text in TACE16 encoding, I can use regex directly over it since (as I understand) re module runs matching over bytes.

Two basic questions,

  1. How do I approach writing a new text encoding codec for python and register it with the codec module.
  2. How would I convert utf-8 encoded pattern for regex into the custom codec so that the pattern and input string for re.match/search is consistent.

Links:
[1] https://github.com/vanangamudi/tace16-utf8-converter/blob/master/tace16.py
[2] https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding

--
Thanks,
வணங்காமுடி
(vanangamudi)