I want to use this encoding https://github.com/vanangamudi/tace16-utf8-converter/blob/master/tace16.py for Tamil language text
As written, it sounds like you just want help. If so, this list is for proposals to change Python itself (including the standard library), and this should have been posted to python-list or to StackExchange.
If you do mean to propose this for the stdlib, it is highly unlikely to get in as proposed since the encoding commandeers private space in the BMP, which is a scarce resource. We *can* do that, but it's very likely that the general sentiment will be "do it in a PyPI module, then it *can't* cause anybody else any trouble." In principle it's not our job to "fix" Unicode. That's the work of the relevant national standards body for Tamil and the Unicode Consortium. (I am not authoritative, so if that's what you want, don't take my word for it. I just want you to be prepared for what I expect to be strong pushback, and what the argument will be.)
About the proposal:
If you are planning to use TACE16 as an interchange format, you don't need a codec; you just treat it as normal UTF-8 (or any other UTF, for that matter). Python does not care whether a character is standard or private, it just adds it to the str the codec is building.
If you propose to use the codec to translate standard Unicode to TACE16 as the internal format, the obvious (rough) idea would be to just plug the converter you have written into the stdlib's Unicode codecs as a post-processor when there is a Unicode character in the (standard) Tamil block. This would then handle both the standard Unicode encoding for Tamil, as well as TACE16 (because it would just pass through the UTF-8 part, and the converter would ignore it).
You may want two separate codecs for output: one which produces TACE16 for you, and another which produces standard Unicode for anyone who doesn't have TACE16 capability.
Exactly how to do that is above my pay grade, it depends on how the postprocessor works, which depends on Tamil language knowledge that I don't have. Whether to rewrite the converter in C is up to you, it's possible to call Python from C.
Two basic questions,
- How do I approach writing a new text encoding codec for python and register it with the codec module.
Start here: /Users/steve/src/Python/cpython/Doc/library/codecs.rst /Users/steve/src/Python/cpython/Doc/c-api/codec.rst
To write them in C, follow the code in Likely needed (forgot where the Unicode codecs live, try codecs.[ch] first): /Users/steve/src/Python/cpython/Python/codecs.c /Users/steve/src/Python/cpython/Include/codecs.h /Users/steve/src/Python/cpython/Objects/stringlib/codecs.h /Users/steve/src/Python/cpython/Objects/unicodectype.c /Users/steve/src/Python/cpython/Lib/codecs.py /Users/steve/src/Python/cpython/Modules/_codecsmodule.c Probably not needed: /Users/steve/src/Python/cpython/Modules/cjkcodecs /Users/steve/src/Python/cpython/Modules/clinic/_codecsmodule.c.h
- How would I convert utf-8 encoded pattern for regex into the custom codec so that the pattern and input string for re.match/search is consistent.
You don't. That's the point of the codec: you convert all text (including source program text) into an internal "abstract text" type (ie, str), and then it "just works". Instead, you would read program text as utf-8-tace16 by placing a PEP 263 coding cookie in one of the first two lines of your program, like this:
# -*- encoding: utf-8-tace16 -*-
If you think that's ugly, read the PEP for alternative forms. If you want to avoid it entirely, I'm not sure it's possible, but python-list or StackExchange are better places to ask.