Where to contribute Unicode General Category encoding/decoding

Pander Musubi pander.musubi at gmail.com
Fri Dec 14 17:22:31 CET 2012


On Friday, December 14, 2012 2:07:51 PM UTC+1, Pander Musubi wrote:
> On Friday, December 14, 2012 1:06:23 AM UTC+1, Steven D'Aprano wrote:
> 
> > On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote:
> 
> > 
> 
> > 
> 
> > 
> 
> > > I was expecting PyPI. Here is the code, please advise on where to submit
> 
> > 
> 
> > > it:
> 
> > 
> 
> > >   http://pastebin.com/dbzeasyq
> 
> > 
> 
> > 
> 
> > 
> 
> > If anywhere, either a third-party module, or the unicodedata standard 
> 
> > 
> 
> > library module.
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > Some unanswered questions:
> 
> > 
> 
> > 
> 
> > 
> 
> > - when would somebody need this function?
> 
> > 
> 
> 
> 
> When working with Unicode metedata, see below.
> 
> 
> 
> > 
> 
> > 
> 
> > - why is is called "decodeUnicodeGeneralCategory" when it 
> 
> > 
> 
> >   doesn't seem to have anything to do with decoding?
> 
> 
> 
> It is actually a simple LUT. I like your improvements below.
> 
> 
> 
> > - why is the parameter "sortable" called sortable, when it
> 
> > 
> 
> >   doesn't seem to have anything to do with sorting?
> 
> 
> 
> The values return are alphabetically sortable.
> 
> 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > If this is useful at all, it would be more useful to just expose the data 
> 
> > 
> 
> > as a dict, and forget about an unnecessary wrapper function:
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > from collections import namedtuple
> 
> > 
> 
> > r = namedtuple("record", "other name desc")  # better field names needed!
> 
> > 
> 
> > 
> 
> > 
> 
> > GC = {
> 
> > 
> 
> >     'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
> 
> > 
> 
> >     'Cc': r('Control', 'Control', 
> 
> > 
> 
> >             'a C0 or C1 control code'), # a.k.a. cntrl
> 
> > 
> 
> >     'Cf': r('Format', 'Format', 'a format control character'),
> 
> > 
> 
> >     'Cn': r('Unassigned', 'Unassigned', 
> 
> > 
> 
> >             'a reserved unassigned code point or a noncharacter'),
> 
> > 
> 
> >     'Co': r('Private Use', 'Private_Use', 'a private-use character'),
> 
> > 
> 
> >     'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
> 
> > 
> 
> >     'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
> 
> > 
> 
> >     'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
> 
> > 
> 
> >     'Ll': r('Letter, Lowercase', 'Lowercase_Letter', 
> 
> > 
> 
> >             'a lowercase letter'),
> 
> > 
> 
> >     'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
> 
> > 
> 
> >     'Lo': r('Letter, Other', 'Other_Letter', 
> 
> > 
> 
> >             'other letters, including syllables and ideographs'),
> 
> > 
> 
> >     'Lt': r('Letter, Titlecase', 'Titlecase_Letter', 
> 
> > 
> 
> >             'a digraphic character, with first part uppercase'),
> 
> > 
> 
> >     'Lu': r('Letter, Uppercase', 'Uppercase_Letter', 
> 
> > 
> 
> >             'an uppercase letter'),
> 
> > 
> 
> >     'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
> 
> > 
> 
> >     'Mc': r('Mark, Spacing', 'Spacing_Mark', 
> 
> > 
> 
> >             'a spacing combining mark (positive advance width)'),
> 
> > 
> 
> >     'Me': r('Mark, Enclosing', 'Enclosing_Mark',
> 
> > 
> 
> >             'an enclosing combining mark'),
> 
> > 
> 
> >     'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark', 
> 
> > 
> 
> >             'a nonspacing combining mark (zero advance width)'),
> 
> > 
> 
> >     'N' : r('Number', 'Number', 'Nd | Nl | No'),
> 
> > 
> 
> >     'Nd': r('Number, Decimal', 'Decimal_Number', 
> 
> > 
> 
> >             'a decimal digit'), # a.k.a. digit
> 
> > 
> 
> >     'Nl': r('Number, Letter', 'Letter_Number', 
> 
> > 
> 
> >             'a letterlike numeric character'),
> 
> > 
> 
> >     'No': r('Number, Other', 'Other_Number',
> 
> > 
> 
> >             'a numeric character of other type'),
> 
> > 
> 
> >     'P' : r('Punctuation', 'Punctuation',          
> 
> > 
> 
> >             'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
> 
> > 
> 
> >     'Pc': r('Punctuation, Connector', 'Connector_Punctuation', 
> 
> > 
> 
> >             'a connecting punctuation mark, like a tie'),
> 
> > 
> 
> >     'Pd': r('Punctuation, Dash', 'Dash_Punctuation', 
> 
> > 
> 
> >             'a dash or hyphen punctuation mark'),
> 
> > 
> 
> >     'Pe': r('Punctuation, Close', 'Close_Punctuation', 
> 
> > 
> 
> >             'a closing punctuation mark (of a pair)'),
> 
> > 
> 
> >     'Pf': r('Punctuation, Final', 'Final_Punctuation', 
> 
> > 
> 
> >             'a final quotation mark'),
> 
> > 
> 
> >     'Pi': r('Punctuation, Initial', 'Initial_Punctuation',
> 
> > 
> 
> >             'an initial quotation mark'),
> 
> > 
> 
> >     'Po': r('Punctuation, Other', 'Other_Punctuation', 
> 
> > 
> 
> >             'a punctuation mark of other type'),
> 
> > 
> 
> >     'Ps': r('Punctuation, Open', 'Open_Punctuation',
> 
> > 
> 
> >             'an opening punctuation mark (of a pair)'),
> 
> > 
> 
> >     'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),
> 
> > 
> 
> >     'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),
> 
> > 
> 
> >     'Sk': r('Symbol, Modifier', 'Modifier_Symbol',
> 
> > 
> 
> >             'a non-letterlike modifier symbol'),
> 
> > 
> 
> >     'Sm': r('Symbol, Math', 'Math_Symbol', 
> 
> > 
> 
> >             'a symbol of mathematical use'),
> 
> > 
> 
> >     'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),
> 
> > 
> 
> >     'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),
> 
> > 
> 
> >     'Zl': r('Separator, Line', 'Line_Separator',
> 
> > 
> 
> >             'U+2028 LINE SEPARATOR only'),
> 
> > 
> 
> >     'Zp': r('Separator, Paragraph', 'Paragraph_Separator',
> 
> > 
> 
> >             'U+2029 PARAGRAPH SEPARATOR only'),
> 
> > 
> 
> >     'Zs': r('Separator, Space', 'Space_Separator', 
> 
> > 
> 
> >             'a space character (of various non-zero widths)'),
> 
> > 
> 
> >     }
> 
> > 
> 
> > 
> 
> > 
> 
> > del r
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > Usage is then trivially the same as normal dict and attribute access:
> 
> > 
> 
> > 
> 
> > 
> 
> > py> GC['Ps'].desc
> 
> > 
> 
> > 'an opening punctuation mark (of a pair)'
> 
> > 
> 
> > 
> 
> > 
> 
> 
> 
> Thank you for the improvements. I have some more extra dicts in this way such as:
> 
>   http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> 
> where this general category is begin used. This information is useful when handling Unicode metadata.
> 
> 
> 
> I think I will approach both
> 
>   http://pypi.python.org/pypi/unicodeblocks/
> 
> and
> 
>   http://pypi.python.org/pypi/unicodescript/
> 
> to see who will adopt this.
> 
> 
> 
> Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.
> 
> 
> 
> Thanks for all your help,
> 
> 
> 
> Pander
> 
> 
> 
> > 
> 
> > 
> 
> > 
> 
> > 
> 
> > -- 
> 
> > 
> 
> > Steven

Ah, it will become a feature request for http://docs.python.org/3/library/unicodedata.html



More information about the Python-list mailing list