TACE16 text encoding for Tamil language
Hi, I want to use this encoding <https://github.com/vanangamudi/tace16-utf8-converter/blob/master/tace16.py>[1] for Tamil language text because it is more consistent with the language's nature, and Unicode encoding severely damages(read more here[2] <https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding>) the intrinsic features of the fusion of alphabets. a specific example would be when dealing with regex. *'^[சிகு]'* is the intended expression for lines that starts with either 'சி' or 'கு' just like how in English '^[ab]' matches lines that start with either 'a' or 'b'. But since Unicode represents some of the eastern languages with multiple code points, '^[சிகு]' basically translates to '^[ச,ி,க,ு]' (using the commas for clarity) சி -> ச,ி and கு -> க,ு . Running the expression over a few words in python, gives the results as shown in the attached image. Note: expected results can be obtained by using this expression '^(சி|கு)' but this works for this specific case, but there should be a way to write expressions to match சிசிசிகுகுசிகு? regex in tamil is not a python issue. it is a unicode issue. I suppose if I can encode Tamil text in TACE16 encoding, I can use regex directly over it since (as I understand) re module runs matching over bytes. Two basic questions, 1. How do I approach writing a new text encoding codec for python and register it with the codec module. 2. How would I convert utf-8 encoded pattern for regex into the custom codec so that the pattern and input string for re.match/search is consistent. Links: [1] https://github.com/vanangamudi/tace16-utf8-converter/blob/master/tace16.py [2] https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding -- Thanks, வணங்காமுடி (vanangamudi)
You wrote:
I want to use this encoding <https://github.com/vanangamudi/tace16-utf8-converter/blob/master/tace16.py> for Tamil language text
As written, it sounds like you just want help. If so, this list is for proposals to change Python itself (including the standard library), and this should have been posted to python-list or to StackExchange. If you do mean to propose this for the stdlib, it is highly unlikely to get in as proposed since the encoding commandeers private space in the BMP, which is a scarce resource. We *can* do that, but it's very likely that the general sentiment will be "do it in a PyPI module, then it *can't* cause anybody else any trouble." In principle it's not our job to "fix" Unicode. That's the work of the relevant national standards body for Tamil and the Unicode Consortium. (I am not authoritative, so if that's what you want, don't take my word for it. I just want you to be prepared for what I expect to be strong pushback, and what the argument will be.) About the proposal: If you are planning to use TACE16 as an interchange format, you don't need a codec; you just treat it as normal UTF-8 (or any other UTF, for that matter). Python does not care whether a character is standard or private, it just adds it to the str the codec is building. If you propose to use the codec to translate standard Unicode to TACE16 as the internal format, the obvious (rough) idea would be to just plug the converter you have written into the stdlib's Unicode codecs as a post-processor when there is a Unicode character in the (standard) Tamil block. This would then handle both the standard Unicode encoding for Tamil, as well as TACE16 (because it would just pass through the UTF-8 part, and the converter would ignore it). You may want two separate codecs for output: one which produces TACE16 for you, and another which produces standard Unicode for anyone who doesn't have TACE16 capability. Exactly how to do that is above my pay grade, it depends on how the postprocessor works, which depends on Tamil language knowledge that I don't have. Whether to rewrite the converter in C is up to you, it's possible to call Python from C.
Two basic questions,
1. How do I approach writing a new text encoding codec for python and register it with the codec module.
Start here: /Users/steve/src/Python/cpython/Doc/library/codecs.rst /Users/steve/src/Python/cpython/Doc/c-api/codec.rst To write them in C, follow the code in Likely needed (forgot where the Unicode codecs live, try codecs.[ch] first): /Users/steve/src/Python/cpython/Python/codecs.c /Users/steve/src/Python/cpython/Include/codecs.h /Users/steve/src/Python/cpython/Objects/stringlib/codecs.h /Users/steve/src/Python/cpython/Objects/unicodectype.c /Users/steve/src/Python/cpython/Lib/codecs.py /Users/steve/src/Python/cpython/Modules/_codecsmodule.c Probably not needed: /Users/steve/src/Python/cpython/Modules/cjkcodecs /Users/steve/src/Python/cpython/Modules/clinic/_codecsmodule.c.h
2. How would I convert utf-8 encoded pattern for regex into the custom codec so that the pattern and input string for re.match/search is consistent.
You don't. That's the point of the codec: you convert all text (including source program text) into an internal "abstract text" type (ie, str), and then it "just works". Instead, you would read program text as utf-8-tace16 by placing a PEP 263 coding cookie in one of the first two lines of your program, like this: # -*- encoding: utf-8-tace16 -*- If you think that's ugly, read the PEP for alternative forms. If you want to avoid it entirely, I'm not sure it's possible, but python-list or StackExchange are better places to ask. Regards, Steve
On Sat, May 1, 2021 at 11:17 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Start here: /Users/steve/src/Python/cpython/Doc/library/codecs.rst /Users/steve/src/Python/cpython/Doc/c-api/codec.rst
To write them in C, follow the code in Likely needed (forgot where the Unicode codecs live, try codecs.[ch] first): /Users/steve/src/Python/cpython/Python/codecs.c /Users/steve/src/Python/cpython/Include/codecs.h /Users/steve/src/Python/cpython/Objects/stringlib/codecs.h /Users/steve/src/Python/cpython/Objects/unicodectype.c /Users/steve/src/Python/cpython/Lib/codecs.py /Users/steve/src/Python/cpython/Modules/_codecsmodule.c Probably not needed: /Users/steve/src/Python/cpython/Modules/cjkcodecs /Users/steve/src/Python/cpython/Modules/clinic/_codecsmodule.c.h
I assume the "cpython" part of these paths here is your local clone of the CPython GitHub repo? (Otherwise these local filepaths from your computer don't make sense.)
Jonathan Goble writes:
I assume the "cpython" part of these paths here is your local clone of the CPython GitHub repo? (Otherwise these local filepaths from your computer don't make sense.)
Thanks for catching that! Sorry, I was concentrating on stifling irrelevant Unicode politics. :-) You need a local clone of the GitHub repo, and the various possibly relevant files are in Doc/library/codecs.rst (these two are available online) Doc/c-api/codec.rst Lib/codecs.py Python/codecs.c Include/codecs.h Objects/stringlib/codecs.h Objects/unicodectype.c Modules/_codecsmodule.c Modules/cjkcodecs Modules/clinic/_codecsmodule.c.h Steve
On Sun, May 2, 2021 at 1:47 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Jonathan Goble writes:
I assume the "cpython" part of these paths here is your local clone of the CPython GitHub repo? (Otherwise these local filepaths from your computer don't make sense.)
Thanks for catching that!
Sorry, I was concentrating on stifling irrelevant Unicode politics. :-) You need a local clone of the GitHub repo, and the various possibly relevant files are in
Doc/library/codecs.rst (these two are available online) Doc/c-api/codec.rst
Lib/codecs.py
Python/codecs.c Include/codecs.h Objects/stringlib/codecs.h Objects/unicodectype.c Modules/_codecsmodule.c Modules/cjkcodecs Modules/clinic/_codecsmodule.c.h
And for the record and OP's benefit, all of these files are also available on GitHub here for viewing purposes: https://github.com/python/cpython
participants (3)
-
Jonathan Goble
-
Stephen J. Turnbull
-
பா. மு. செல்வக்குமார்