Hi,
I want to use
this encoding[1] for Tamil language text because it is more consistent with the language's nature, and Unicode encoding severely damages(
read more here[2]) the intrinsic features of the fusion of alphabets.
a specific example would be when dealing with regex. '^[சிகு]' is the intended expression for lines that starts with
either 'சி' or 'கு' just like how in English '^[ab]' matches lines that
start with either 'a' or 'b'. But since Unicode represents some of the eastern languages with
multiple code points, '^[சிகு]' basically translates to '^[ச,ி,க,ு]' (using the commas for clarity) சி
-> ச,ி and கு -> க,ு . Running the expression over a few words in python, gives the results as shown in the attached image.
Note: expected results can be obtained by using this expression
'^(சி|கு)' but this works for this specific case, but there should be a
way to write expressions to match சிசிசிகுகுசிகு?
regex in tamil is not a python issue. it is a unicode issue. I suppose if I can encode Tamil text in TACE16 encoding, I can use regex directly over it since (as I understand) re module runs matching over bytes.
Two basic questions,
- How do I approach writing a new text encoding codec for python and register it with the codec module.
- How would I convert utf-8 encoded pattern for regex into the custom codec so that the pattern and input string for re.match/search is consistent.
Links:
--
Thanks,
வணங்காமுடி
(vanangamudi)