Extend unicodedata with a name/pattern/regex search for character entity references?

Sun Sep 4 00:02:12 EDT 2016

Thomas 'PointedEars' Lahn wrote:

> Veek. M wrote:
> 
>> https://mail.python.org/pipermail//python-ideas/2014-October/029630.htm
>> 
>> Wanted to know if the above link idea,
> 
> … which is 404-compliant; the Internet Archive does not have it either
> …
> 
>> had been implemented
> 
> Probably not.
> 
>> and if there's a module that accepts a pattern like 'cap' and give
>> you all the instances of unicode 'CAP' characters.
> 
> I do not know any.
> 
>>  ⋂ \bigcap
>>  ⊓ \sqcap
>>  ∩ \cap
>>  ♑ \capricornus
>>  ⪸ \succapprox
>>  ⪷ \precapprox
>> 
>> (above's from tex)
>> 
>> I found two useful modules in this regard: unicode_tex, unicodedata
>> but unicodedata is a builtin which does not do globs, regexs - so
>> it's kind of limiting in nature.
> 
> Quick hack:
> 
> #--------------------------------------------------------------------
> from unicode_tex import unicode_to_tex_map
> 
> for key, value \
> in filter(lambda item: "cap" in item[1], unicode_to_tex_map.items()):
>     print(key, value)
> #--------------------------------------------------------------------
> 
> (Optimizations are welcome.)
> 
> It is easy to come up with methods that take a globbing or a regular
> expression (globbing expressions can be turned into regular
> expressions easily) and returns, perhaps as a dictionary or list of
> tuples, only the matching entries.
> 
> Other than that I think you will have to turn the Unicode Character
> Database (which is available via HTTP as one huge text file; see the
> Python Tutorial on “Internet Access” for how to get it dynamically)
> into whatever form suits you for querying it.
>  
>> Would be nice if you could search html/xml character entity
>> references as well.
> 
> For what purpose?
> 
> Your posting is lacking a real name in the “From” header field.
> 

Ouch! Sorry for the bad link Thomas. The link is titled '[Python-ideas] 
Extend unicodedata with a name search' and I suspect this updated link 
data (http://code.activestate.com/lists/python-ideas/29504/) may work - 
if it doesn't you could google the title.

I don't want to dump/replicate the existing Unicode data in module 
'unicodedata'.

Regarding purpose, well I need this for hexchat. I IRC a lot and often, 
I want to ask a question involving math symbols. I've written some 
python (included at the bottom) that translates:
 \help filter_word #into a list of symbols and names 
(serves as a memory jog). It also translates stuff like:
 A \cap B \epsilon C to A ∩ B ε C.
but all this works with a subset of tex - it can't do complicated 
formula. I wanted to extend it further.. I don't think I shall be able 
to subscript integrals easily but I could make better use of the 
available unicode, which means making it more accessible (hence the 
pattern matching feature) - html/xml entities provide a new way of 
remembering stuff.

---------------------------
Regarding the name (From field), my name *is* Veek.M though I tend to 
shorten it to Vek.M on Google (i think Veek was taken or some such 
thing). Just to be clear, my parents call me something closely related 
to Veek that is NOT Beek or Peek or Squeak or Sneak and my official name 
is something really weird. Identity theft being what it is, I probably 
am lying anyhow about all this, but it sounds funny so :p

import hexchat
import re, unicode_tex, unicodedata

__module_name__ = 'Unicode'
__module_version__ = '0.1'
__module_description__ = 'Substitute \whatever with Unicode char in 
cmdline input'

#re_repl = unicodedata.lookup('N-ARY UNION')

def debug(*args):
    hexchat.prnt('#####{}#####'.format(*args))

def print_help(*args):
    hexchat.prnt('{}'.format(*args))

def send_message(word, word_eol, userdata):
    if not(word[0] == "65293"):
        return

    msg = hexchat.get_info('inputbox')
    if msg is None:
        return

    x = re.match(r'(^\\help)\s+(\w+)', msg)
    if x:
        filter = x.groups()[1]
        for key, value in unicode_tex.tex_to_unicode_map.items():
            if filter in key:
                print_help(value + ' ' + key)
        hexchat.command("settext %s" % '')
        return

    tex_matches = re.findall(r'(\\\w+)', msg)
    for tex_word in tex_matches:
        repl = unicode_tex.tex_to_unicode_map.get(tex_word)

        if repl is None:
            repl = 'err'

        msg = re.sub(re.escape(tex_word), repl, msg)

    hexchat.command("settext %s" % msg)

hexchat.hook_print('Key Press', send_message)